gpt4 book ai didi

apache-spark - 列在 pySpark 中不可迭代

转载 作者:行者123 更新时间:2023-12-01 23:15:45 27 4
gpt4 key购买 nike

所以,我们有点困惑。在 Jupyter Notebook 中,我们有以下数据框:

+--------------------+--------------+-------------+--------------------+--------+-------------------+ 
| created_at|created_at_int| screen_name| hashtags|ht_count| single_hashtag|
+--------------------+--------------+-------------+--------------------+--------+-------------------+
|2017-03-05 00:00:...| 1488672001| texanraj| [containers, cool]| 1| containers|
|2017-03-05 00:00:...| 1488672001| texanraj| [containers, cool]| 1| cool|
|2017-03-05 00:00:...| 1488672002| hubskihose|[automation, future]| 1| automation|
|2017-03-05 00:00:...| 1488672002| hubskihose|[automation, future]| 1| future|
|2017-03-05 00:00:...| 1488672002| IBMDevOps| [DevOps]| 1| devops|
|2017-03-05 00:00:...| 1488672003|SoumitraKJana|[VoiceOfWipro, Cl...| 1| voiceofwipro|
|2017-03-05 00:00:...| 1488672003|SoumitraKJana|[VoiceOfWipro, Cl...| 1| cloud|
|2017-03-05 00:00:...| 1488672003|SoumitraKJana|[VoiceOfWipro, Cl...| 1| leader|
|2017-03-05 00:00:...| 1488672003|SoumitraKJana| [Cloud, Cloud]| 1| cloud|
|2017-03-05 00:00:...| 1488672003|SoumitraKJana| [Cloud, Cloud]| 1| cloud|
|2017-03-05 00:00:...| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| voiceofwipro|
|2017-03-05 00:00:...| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| cloud|
|2017-03-05 00:00:...| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1|managedfiletransfer|
|2017-03-05 00:00:...| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| asaservice|
|2017-03-05 00:00:...| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| interconnect2017|
|2017-03-05 00:00:...| 1488672004|SoumitraKJana|[VoiceOfWipro, Cl...| 1| hmi|
|2017-03-05 00:00:...| 1488672005|SoumitraKJana|[Cloud, ManagedFi...| 1| cloud|
|2017-03-05 00:00:...| 1488672005|SoumitraKJana|[Cloud, ManagedFi...| 1|managedfiletransfer|
|2017-03-05 00:00:...| 1488672005|SoumitraKJana|[Cloud, ManagedFi...| 1| asaservice|
|2017-03-05 00:00:...| 1488672005|SoumitraKJana|[Cloud, ManagedFi...| 1| interconnect2017|
+--------------------+--------------+-------------+--------------------+--------+-------------------+
only showing top 20 rows

root
|-- created_at: timestamp (nullable = true)
|-- created_at_int: integer (nullable = true)
|-- screen_name: string (nullable = true)
|-- hashtags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- ht_count: integer (nullable = true)
|-- single_hashtag: string (nullable = true)

我们正在尝试获取每小时的主题标签数量。我们采取的方法是使用Window按single_hashtag进行分区。像这样的事情:

# create WindowSpec                                 
hashtags_24_winspec = Window.partitionBy(hashtags_24.single_hashtag). \
orderBy(hashtags_24.created_at_int).rangeBetween(-3600, 3600)

但是,当我们尝试对 ht_count 求和时列使用:

#sum_count_over_time = sum(hashtags_24.ht_count).over(hashtags_24_winspec)

我们收到以下错误:

Column is not iterable
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/sql/column.py", line 240, in __iter__
raise TypeError("Column is not iterable")
TypeError: Column is not iterable

错误消息信息不多,我们很困惑,究竟要调查哪一列。有什么想法吗?

最佳答案

您使用了错误的总和:

from pyspark.sql.functions import sum

sum_count_over_time = sum(hashtags_24.ht_count).over(hashtags_24_winspec)

实际上,您可能需要别名或包导入:

from pyspark.sql.functions import sum as sql_sum

# or

from pyspark.sql.functions as F
F.sum(...)

关于apache-spark - 列在 pySpark 中不可迭代,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42754922/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com