python - 在模式中指定 DateType() 时从 RDD 创建 DataFrame-6ren

python - 在模式中指定 DateType() 时从 RDD 创建 DataFrame

转载作者：行者123 更新时间：2023-11-28 22:12:14

25

4

我正在从 RDD 创建一个 DataFrame，其中一个值是一个 date。我不知道如何在架构中指定 DateType()。

让我来说明手头的问题-

我们可以将 date 加载到 DataFrame 中的一种方法是首先将其指定为字符串，然后使用 to_date() 将其转换为正确的 date功能。

from pyspark.sql.types import Row, StructType, StructField, StringType, IntegerType, DateType
from pyspark.sql.functions import col, to_date
values=sc.parallelize([(3,'2012-02-02'),(5,'2018-08-08')])
rdd= values.map(lambda t: Row(A=t[0],date=t[1]))

# Importing date as String in Schema
schema = StructType([StructField('A', IntegerType(), True), StructField('date', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)

# Finally converting the string into date using to_date() function.
df = df.withColumn('date',to_date(col('date'), 'yyyy-MM-dd'))
df.show()
+---+----------+
|  A|      date|
+---+----------+
|  3|2012-02-02|
|  5|2018-08-08|
+---+----------+

df.printSchema()
root
 |-- A: integer (nullable = true)
 |-- date: date (nullable = true)

有没有一种方法，我们可以在 schema 中使用 DateType() 并避免将 string 转换为 date 明确？

像这样的——

values=sc.parallelize([(3,'2012-02-02'),(5,'2018-08-08')])
rdd= values.map(lambda t: Row(A=t[0],date=t[1]))
# Somewhere we would need to specify date format 'yyyy-MM-dd' too, don't know where though.
schema = StructType([StructField('A', DateType(), True), StructField('date', DateType(), True)])

更新:根据@user10465355的建议，以下代码有效 -

import datetime
schema = StructType([
  StructField('A', IntegerType(), True),
  StructField('date', DateType(), True)
])
rdd= values.map(lambda t: Row(A=t[0],date=datetime.datetime.strptime(t[1], "%Y-%m-%d")))
df = sqlContext.createDataFrame(rdd, schema)
df.show()
+---+----------+
|  A|      date|
+---+----------+
|  3|2012-02-02|
|  5|2018-08-08|
+---+----------+
df.printSchema()
root
 |-- A: integer (nullable = true)
 |-- date: date (nullable = true)

最佳答案

长话短说，与外部对象的 RDD 一起使用的模式不应以这种方式使用 - 声明的类型应该反射(reflect)数据的实际状态，而不是所需的状态。

换句话说，允许:

schema = StructType([
  StructField('A', IntegerType(), True),
  StructField('date', DateType(), True)
])

date字段对应的数据should use datetime.date .因此，例如您的 RDD[Tuple[int, str]]:

import datetime

spark.createDataFrame(
    # Since values from the question are just two element tuples
    # we can use mapValues to transform the "value"
    # but in general case you'll need map
    values.mapValues(datetime.date.fromisoformat),
    schema
)

最接近所需行为的方法是使用 dicts 使用 JSON 阅读器转换数据 (RDD[Row])

from pyspark.sql import Row

spark.read.schema(schema).json(rdd.map(Row.asDict))

或更好的显式 JSON 转储:

import json
spark.read.schema(schema).json(rdd.map(Row.asDict).map(json.dumps))

但这当然比显式转换要昂贵得多，顺便说一句，在像您描述的简单情况下，显式转换很容易实现自动化:

from pyspark.sql.functions import col

(spark
    .createDataFrame(values, ("a", "date"))
    .select([col(f.name).cast(f.dataType) for f in schema]))

关于python - 在模式中指定 DateType() 时从 RDD 创建 DataFrame，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55038612/

25

4

0

文章推荐： python - 互联网服务提供商计划

文章推荐： java - Tomcat内部是如何创建线程的？

文章推荐： java - 使用tomcat运行时在java中获取路径文件夹

scala - Spark 未检测到 dateType，并且无法将 stringType 转换为 DateType
这是我的代码: import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ import org.apac
mysql - DateTime转换支持mysql datetime datetype
如何在 php 中将此日期时间“2012 年 4 月 17 日 05:50 PM”转换为支持 mysql 日期时间格式，即 2012-04-17 17:50:00(秒默认为 00)。最佳答案使用标
c# - OpenXML - Cell.DateType 为空
我无法确定 Cell 何时是日期。我注意到 DataType 为空，所以我无法区分它是否是日期数字。我正在使用下一个代码来提取单元格: WorksheetPart worksheetPart =
pyspark - 无法使用 PySpark 使用 DateType 创建字段
我正在尝试使用示例记录创建数据框。其中一个字段是 DateType。我收到 DatType 字段中提供的值的错误。请找到下面的代码错误是 TypeError: field date: DateType
sqlite - 在 sqlite3 上创建 DateType 字段
我想在 sqlite 上创建一个表，其中一个字段为 DateTime (YYYY-MM-DD)，我该如何创建它？我正在尝试: create table test (_date datetime);
symfony - 从 Symfony DateType::class 禁用日期和月份
->add('attendFrom', DateType::class, array( 'widget' => 'choice', 'h
type-conversion - 无法将 LongType 转换为 DateType 错误
(PySpark 新手) 我已经做了很多搜索并尝试了很多不同的方式，我在这里发布我的最后一次尝试: 我有如下所示的数据框: txn_dt datetime64[ns] id int64 我正在尝试使用
forms - Symfony 4 处理 null DateType 表单字段
我在 Symfony 4 中有一个表单，我将 DateType 实现为文本字段 ->add('DateOfBirth', DateType::class, array( 'required'
python - 在模式中指定 DateType() 时从 RDD 创建 DataFrame
我正在从 RDD 创建一个 DataFrame，其中一个值是一个 date。我不知道如何在架构中指定 DateType()。让我来说明手头的问题- 我们可以将 date 加载到 DataFrame
scala - 在 Spark 中为 DataFrame 模式定义 DateType 转换
我正在从 CSV 文件中读取 DataFrame，其中第一列是事件日期和时间，例如 2016-08-08 07:45:28+03 在下面的代码中，是否可以在模式定义中指定如何将此类字符串转换为日期？
php - 使用 Symfony 3 Forms 插入 DateType 字段时出现问题
我搜索了很多，但在 StackOverflow 中找不到任何关于我的问题的... 我有这个结构:(恢复) $form = $this->createFormBuilder($vendedor)
python - 无法使用带有复合行键(UTF8Type、DateType)的 Pycassa 插入 Cassandra 列族
我有一个具有以下架构的 Cassandra 列族(使用 Pycassa 创建): ColumnFamily: tracker Key Validation Class: org.apache.ca
javascript - 为什么在为名称为 "datetype"的 radio 输入类型执行 html 发布时，它没有正确发布
在 asp 中，当我尝试从 request.form("datetype") 中检索名称时，它没有获取空字符串的值？但是当我更改名称时，它似乎起作用了。不起作用: Select your Mood
python - Spark : error reading DateType columns in partitioned parquet data
我在 S3 中有按 nyc_date 分区的 Parquet 数据，格式为 s3://mybucket/mykey/nyc_date=Y-m-d/*.gz.parquet。我有一个 DateType
asp.net-mvc - 模型数据注释 [DataType(DateType.Date)] 呈现 DatePicker 但隐藏我的值？
我的 INV_Assets 模型中有 3 个日期字段:acquired_date、dispose_date 和 created_date。我让它们在一些 @Html.EditorFor() 中显示为格
forms - Symfony 表格 : "Array to string conversion" exception in modified DateType class
我想构建一个自定义 DateType 类。为此，我将 Symfony\Component\Form\Extension\Core\Type\DateType 类复制到我的 src/目录并更改了类名和
java - Spark 2.0 groupBy 列，然后在 datetype 列上获取 max(date)
我在 Java 中使用 Spark 2.0。我有一个看起来像这样的数据集: ------+----+----+----+----------------+ ID|col1|col2|col3|
java.lang.ClassCastException : org. hibernate.type.DateType 无法转换为 org.hibernate.type.VersionType
在 Eclipse Luna 上使用 JPA 工具从表生成实体后，出现此错误。我在 Eclipse 上安装了 Hibernate Tools 4.0.1，并在 JPA 项目属性中选择了以下版本: 我
java - 如何将 mili 秒格式更改为 yyyy-mm-dd hh :mm:ss to insert to mysql with datetype colunm
我想在 Wordpress 的 post_date 列中插入日期和时间。列数据类型为 datetime yyyy-mm-dd hh:mm:ss。我要生成的第一行是:2015-01-01 01:01:0

首页

博学

6Ren·AI

商城

python - 在模式中指定 DateType() 时从 RDD 创建 DataFrame