gpt4 book ai didi

dataframe - Spark 数据帧 : collect () vs select ()

转载 作者:行者123 更新时间:2023-12-03 10:22:31 24 4
gpt4 key购买 nike

调用 collect()在 RDD 上会将整个数据集返回给驱动程序,这可能导致内存不足,我们应该避免这种情况。

威尔collect()如果在数据帧上调用,行为方式相同吗?select()呢?方法?
它是否也以与 collect() 相同的方式工作?如果在数据帧上调用?

最佳答案

Actions vs Transformations

  • Collect (Action) - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.


spark-sql doc

select(*cols) (transformation) - Projects a set of expressions and returns a new DataFrame.

Parameters: cols – list of column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.**

df.select('*').collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
df.select('name', 'age').collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]


执行 select(column-name1,column-name2,etc)数据框上的方法,返回一个新的数据框,该数据框仅包含在 select() 中选择的列功能。

例如假设 df有几列,包括“名称”和“值”以及其他一些列。
df2 = df.select("name","value")
df2将仅包含 df 的整个列中的两列(“名称”和“值”)

df2 作为 select 的结果将在执行程序中而不是在驱动程序中(如使用 collect() 的情况)

sql-programming-guide
df.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)

# Select only the "name" column
df.select("name").show()
# +-------+
# | name|
# +-------+
# |Michael|
# | Andy|
# | Justin|
# +-------+

您可以运行 collect()在数据帧上 ( spark docs )
>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> spark.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]

spark docs

To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).

关于dataframe - Spark 数据帧 : collect () vs select (),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44174747/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com