gpt4 book ai didi

python - PySpark:类型错误:+ 不支持的操作数类型: 'datetime.datetime' 和 'str'

转载 作者:行者123 更新时间:2023-12-01 00:43:53 26 4
gpt4 key购买 nike

我在 PySpark 中有具有以下架构的 DataFrame:

root
|-- id: string (nullable = true)
|-- date: timestamp (nullable = true)
|-- time: string (nullable = true)
|-- start: timestamp (nullable = true)
|-- end: timestamp (nullable = true)

我想再添加一列 timestamp 类型的 date_time:

import datetime

to_datetime_func = udf (lambda d, t: datetime.strptime(d+" "+t, "%Y-%m-%d %H:%M:%S"), TimestampType())
df = df.withColumn("date_time", to_datetime_func("date","time"))

这段代码编译得很好。但是,当我运行使用 date_time 列的简单过滤操作时,出现错误:

root
|-- id: string (nullable = true)
|-- date_time: timestamp (nullable = true)
|-- start: timestamp (nullable = true)
|-- end: timestamp (nullable = true)


from pyspark.sql import functions as func

df \
.filter(func.col("date_time")>=func.col("start"))
.select("id","date_time","start") \
.show()

错误:

Py4JJavaError: An error occurred while calling o2966.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 30.0 failed 4 times, most recent failure: Lost task 2.3 in stage 30.0 (TID 765, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 403, in main
process()
File "/databricks/spark/python/pyspark/worker.py", line 398, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/databricks/spark/python/pyspark/serializers.py", line 365, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/databricks/spark/python/pyspark/serializers.py", line 147, in dump_stream
for obj in iterator:
File "/databricks/spark/python/pyspark/serializers.py", line 354, in _batched
for item in iterator:
File "<string>", line 1, in <lambda>
File "/databricks/spark/python/pyspark/worker.py", line 83, in <lambda>
return lambda *a: toInternal(f(*a))
File "/databricks/spark/python/pyspark/util.py", line 99, in wrapper
return f(*args, **kwargs)
File "<command-4293391875175815>", line 1, in <lambda>
TypeError: unsupported operand type(s) for +: 'datetime.datetime' and 'str'

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:490)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:444)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:638)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:299)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:50)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:383)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2076)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:223)

更新:

my_concat_func =  udf (lambda d, t: datetime.strptime(d+" "+t, "%Y-%m-%d %H:%M:%S"), StringType())
df = df.withColumn("date", df["date"].cast(StringType()))
df = df.withColumn("date_time", my_concat_func("date","time"))


df.select("date","time","date_time").printSchema()

root
|-- date: string (nullable = true)
|-- time: string (nullable = true)
|-- date_time: string (nullable = true)


df.select("date","time","date_time").show()

ValueError: unconverted data remains: 03:34:26

最佳答案

你能尝试一下并让我知道输出吗:

timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
df \
.filter((func.unix_timestamp('date_time', format=timeFmt) >= func.unix_timestamp('start', format=timeFmt)))
.select("id","date_time","start") \
.show()

编辑

For the question how to get only date and not time :

df = df.withColumn("new_data", func.to_date(df.date, 'yyyy-MM-dd'))
df.printSchema()

df = df.withColumn("new_data", df['new_data'].cast(StringType()))
df.show(10, False)
df.printSchema()

#### Output ####
+------------------------+
|date |
+------------------------+
|2015-07-02T11:22:21.050Z|
|2016-03-20T21:00:00.000Z|
+------------------------+
root
|-- date: string (nullable = true)
|-- new_data: date (nullable = true)
+------------------------+----------+
|date |new_data |
+------------------------+----------+
|2015-07-02T11:22:21.050Z|2015-07-02|
|2016-03-20T21:00:00.000Z|2016-03-20|
+------------------------+----------+
root
|-- date: string (nullable = true)
|-- new_data: string (nullable = true)

关于python - PySpark:类型错误:+ 不支持的操作数类型: 'datetime.datetime' 和 'str',我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57144339/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com