gpt4 book ai didi

apache-spark - 选择满足条件的列

转载 作者:行者123 更新时间:2023-12-04 04:17:46 24 4
gpt4 key购买 nike

我在 zeppelin 中运行以下笔记本:

%spark.pyspark
l = [('user1', 33, 1.0, 'chess'), ('user2', 34, 2.0, 'tenis'), ('user3', None, None, ''), ('user4', None, 4.0, ' '), ('user5', None, 5.0, 'ski')]
df = spark.createDataFrame(l, ['name', 'age', 'ratio', 'hobby'])
df.show()

root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- ratio: double (nullable = true)
|-- hobby: string (nullable = true)
+-----+----+-----+-----+
| name| age|ratio|hobby|
+-----+----+-----+-----+
|user1| 33| 1.0|chess|
|user2| 34| 2.0|tenis|
|user3|null| null| |
|user4|null| 4.0| |
|user5|null| 5.0| ski|
+-----+----+-----+-----+

agg_df = df.select(*[(1.0 - (count(c) / count('*'))).alias(c) for c in df.columns])
agg_df.show()

root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- ratio: double (nullable = true)
|-- hobby: string (nullable = true)
+----+---+-------------------+-----+
|name|age| ratio|hobby|
+----+---+-------------------+-----+
| 0.0|0.6|0.19999999999999996| 0.0|
+----+---+-------------------+-----+

现在,我只想在 agg_df 中选择值 < 0.35 的列。在这种情况下,它应该返回 ['name', 'ratio', 'hobby']

我不知道该怎么做。任何提示?

最佳答案

你的意思是值 < 0.35?。这应该做

>>> [ key for (key,value) in agg_df.collect()[0].asDict().items() if value < 0.35  ]
['hobby', 'ratio', 'name']

使用以下 udf 函数用 Null 替换空白字符串。
from pyspark.sql.functions import udf
process = udf(lambda x: None if not x else (x if x.strip() else None))
df.withColumn('hobby', process(df.hobby)).show()
+-----+----+-----+-----+
| name| age|ratio|hobby|
+-----+----+-----+-----+
|user1| 33| 1.0|chess|
|user2| 34| 2.0|tenis|
|user3|null| null| null|
|user4|null| 4.0| null|
|user5|null| 5.0| ski|
+-----+----+-----+-----+

关于apache-spark - 选择满足条件的列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44113436/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com