dataframe - Spark 数据帧 : collect () vs select ()-6ren

dataframe - Spark 数据帧 : collect () vs select ()

转载作者：行者123 更新时间：2023-12-03 10:22:31

调用 collect()在 RDD 上会将整个数据集返回给驱动程序，这可能导致内存不足，我们应该避免这种情况。

威尔collect()如果在数据帧上调用，行为方式相同吗？select()呢？方法？
它是否也以与 collect() 相同的方式工作？如果在数据帧上调用？

最佳答案

Actions vs Transformations

Collect (Action) - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

spark-sql doc

select(*cols) (transformation) - Projects a set of expressions and returns a new DataFrame.

Parameters: cols – list of column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.**
df.select('*').collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
df.select('name', 'age').collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]

执行 select(column-name1,column-name2,etc)数据框上的方法，返回一个新的数据框，该数据框仅包含在 select() 中选择的列功能。

例如假设 df有几列，包括“名称”和“值”以及其他一些列。

df2 = df.select("name","value")

df2将仅包含 df 的整个列中的两列(“名称”和“值”)

df2 作为 select 的结果将在执行程序中而不是在驱动程序中(如使用 collect() 的情况)

sql-programming-guide

df.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)

# Select only the "name" column
df.select("name").show()
# +-------+
# |   name|
# +-------+
# |Michael|
# |   Andy|
# | Justin|
# +-------+

您可以运行 collect()在数据帧上 ( spark docs )

>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> spark.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]

spark docs

To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).

关于dataframe - Spark 数据帧 : collect () vs select ()，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44174747/

文章推荐： elixir - 为什么在 Ecto 查询中需要 pin 运算符？

文章推荐： javascript - John Resig 继承类和重写方法 - 调用错误的方法

文章推荐： javascript - 使用indexOf在对象数组中搜索特定属性

文章推荐： javascript - 循环中的 THREE.ImageUtils.loadTexture 不起作用？

collections - .collect 带索引
是否有带有索引的.collect？我想做这样的事情: def myList = [ [position: 0, name: 'Bob'], [position: 0, name: 'J
java - 在 Java 的方法中将 Collection 的 Collection 联合成 Collection
我创建了一个 Collection 类，它扩展了 ArrayList 以添加一些有用的方法。它看起来像这样: public class Collection extends ArrayList {
java - 是否有模拟 Collections.singleton()/Collections.singletonList()/Collections.singletonMap() 来获取可变集合？
我知道如果我有元素，我想得到 List/Set/Map 我可以调用这个元素: Collections.singleton()/Collections.singletonList()/Collectio
java - 在 Maven 中，它的 "org.apache.commons.collections:commons-collections"与 "commons-collections:commons-collections"相同吗？
我刚刚在我的 pom 文件中看到 Apache commons-collections 有两个不同的组 ID: commons-collections commons-collect
java - Collections.synchronizedCollection 和 Collections.synchronizedList 或 Collections.synchronizedSet 之间有什么区别
我们可以对所有 Collections 类型的对象(如 Set 和 List)使用 Collections.synchronizedCollection(Collection c)，这就是为什么我们有
c# - 一个 linq 语句中的 Collection> 到 Collection
我有List>我想让它把上一个集合中的所有人复制到List收藏。我是这样做的: var People = new List>{ new List{...},... };
java - 使用 Guava 将多个 Collections 合并到一个 Collection 中的单个 Collection
我想做的是使用良好的旧循环非常简单。假设我有一个包含 B 列表的对象 A。 public class A { public List myListOfB; } 在其他一些方法中，我有一个 As

collections - java Collections 常见方法面试题
在 Capgemini 的采访中，我被问到一个我无法回答的问题。所有集合类和接口(interface)共有的那些方法是什么？最佳答案所有 java 对象类(包括所有集合)都派生自名为 Object

Laravel Collection() 与 collect()
我有一系列存储估计信息的数据库表。当设置某些边界时，我试图从所有数据库表中返回所有数据。收藏 $estimateItems = new Collection(); $esti

collections - Haskell "collections"语言设计
为什么 Haskell 实现如此专注于链表？例如，我知道 Data.Sequence 效率更高大多数列表操作(cons 操作除外)，并且被大量使用；但是，从语法上讲，它“几乎不受支持”。 Has

collections - 主干_.each collection.model为空
我试图简单地将我在 PHP 中请求的内容返回到 JSON。我的问题是每个库存尚未完成。事实上，它是“渲染”，但“this.collection.models”尚未完成，因为请求尚未完成。我应该如何解

collections - 通过System.Collections.Queue传递对象时丢失类型信息
本质上，作为Powershell脚本的一部分，我需要实现广度优先搜索。因此，我需要队列，并且认为System.Collections.Queue与其他任何队列一样好。但是，当我从队列中取出一个对象时，

java - 命名约定模式 : Collection and Collections?
已关闭。这个问题是 off-topic 。目前不接受答案。想要改进这个问题吗？ Update the question所以它是 on-topic用于堆栈溢出。已关闭10 年前。 Improve t

collections - meteor collection.update权限
嗨，我不明白为什么这不起作用？ Notifications.update({'userId':Meteor.userId(), 'notifyUserId':notifyFriendId}, {$se

collections - 如何从常规集合的 collect() 方法中调用具有多个参数的闭包？
假设我有一个闭包: def increment = {value, step -> value + step } 现在我想遍历我的整数集合的每个项目，用 5 递增，并将新元素保存到一个新集合中:

collections - Plone4 collection - 逐页可视化不当行为
使用逐页 View 时，我的 plone 集合文件夹未显示所有项目。基本上我有 9 页包含元素，但第 6 - 8 页显示的内容完全相同。因此，并非所有项目都会显示，即使项目总数对应于应该在集合中的元素

java - Collection 的 Collection
private Map> map ，其中 ProgramCourse 是我的项目中的域类，上面的 map 是我运行项目时域类 Program 的字段以下异常即将到来。 Use of @OneToMan

c# - .NET 中的 System.Collections、System.Collections.Specialized 和 System.Collections.Generic 之间有什么区别？
三者的主要区别是什么？现在，我想分别使用字符串/字符串创建一个键/值对。这三个似乎都有我可以使用的选项。编辑:我只想创建一个简单的哈希表 - 没什么特别复杂的。最佳答案通用集合几乎完全取代了基础

node.js - db.collection.insert 与 db.collection.insertOne 和 db.collection.insertMany 的性能影响
我正在为 NodeJs 使用 mongodb 驱动程序，其中有 3 个方法: 1) db.collection.insert 2) 数据库.collection.insertOne 3) db.col

c# - 将 'System.Collections.Generic.IEnumerable' 转换为 'System.Collections.ObjectModel.Collection'
我有一个集合，我正在尝试使用 Distinct 方法删除重复项。 public static Collection imagePlaylist imagePlaylist = imagePlaylis

行者123

个人简介
我是一名优秀的程序员,十分优秀！

作者热门文章

html - 出于某种原因，IE8 对我的 Sass 文件中继承的 html5 CSS 不友好？

JMeter 在响应断言中使用 span 标签的问题

html - 在 :hover and :active? 上具有不同效果的 CSS 动画

html - 相对于居中的 html 内容固定的 CSS 重复背景？

滴滴打车优惠券免费领取

全站热门文章

中间件vs过滤器

shell脚本快速检查192.168.1网段ip是否在用

ASP.NETCore-日志记录系统(二)

抛物流线参数化方程推导——从几何性质到代码实现

【C语言学习】——命令行编译运行C语言程序的完整流程

让LLM来评判|基础概念

Pulsar客户端如何控制内存使用

G1原理—4.G1垃圾回收的过程之YoungGC

利用mybatis拦截器记录sql，辅助我们建立索引(一)

C#实现Winform程序在系统托盘显示图标&开机自启动

首页

博学

6Ren·AI

商城

dataframe - Spark 数据帧 : collect () vs select ()