pandas - pyspark/EMR 中大型 DataFrame 上的collect() 或 toPandas()-6ren

pandas - pyspark/EMR 中大型 DataFrame 上的collect() 或 toPandas()

转载作者：行者123 更新时间：2023-12-02 05:14:03

我有一个由一台机器“c3.8xlarge”组成的 EMR 集群，在阅读了一些资源后，我了解到我必须允许相当数量的堆外内存，因为我使用的是 pyspark，所以我按如下方式配置了集群:

一名执行人:

spark.executor.内存6g
spark.executor.cores 10
spark.yarn.executor.memoryOverhead 4096

驱动程序:

spark.driver.内存21g

当我 cache() DataFrame 时，它需要大约 3.6GB 内存。

现在，当我在 DataFrame 上调用 collect() 或 toPandas() 时，进程崩溃。

我知道我将大量数据带入驱动程序，但我认为它并没有那么大，而且我无法找出崩溃的原因。

当我调用 collect() 或 toPandas() 时，我收到此错误:

Py4JJavaError: An error occurred while calling o181.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 6.0 failed 4 times, most recent failure: Lost task 5.3 in stage 6.0 (TID 110, ip-10-0-47-207.prod.eu-west-1.hs.internal, executor 9): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Container marked as failed: container_1511879540686_0005_01_000016 on host: ip-10-0-47-207.prod.eu-west-1.hs.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1690)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1678)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1677)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1677)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1905)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1860)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1849)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
    at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
    at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2803)
    at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800)
    at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
    at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2823)
    at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2800)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

====更新====

正如@user6910411所建议的，我已经尝试了提到的解决方案here ，在这种情况下我收到以下错误:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 2.0 failed 4 times, most recent failure: Lost task 7.3 in stage 2.0 (TID 41, ip-10-0-33-57.prod.eu-west-1.hs.internal, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 13.5 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1690)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1678)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1677)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1677)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1905)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1860)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1849)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

关于这里发生的事情有任何提示吗？

最佳答案

TL;DR我相信您严重低估了内存需求。

即使假设数据已完全缓存，存储信息也只会显示将数据带回驱动程序所需的峰值内存的一小部分。

首先 Spark SQL 使用 compressed columnar storage用于缓存。根据数据分布和压缩算法，内存大小可能比未压缩的 Pandas 输出小得多，更不用说简单的 List[Row]。后者还存储列名称，进一步增加内存使用量。
数据收集是间接的，数据同时存储在 JVM 端和 Python 端。虽然一旦数据通过套接字就可以释放 JVM 内存，但峰值内存使用量应该考虑到两者。
简单的toPandas实现首先收集Rows，then creates Pandas DataFrame locally 。这进一步增加(可能加倍)内存使用量。幸运的是，这部分已经在 master (Spark 2.3) 上得到解决，并使用 Arrow 序列化 ( SPARK-13534 - Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas ) 提供更直接的方法。
对于独立于 Apache Arrow 的可能解决方案，您可以检查 Faster and Lower memory implementation toPandas在 Apache Spark 开发人员列表中。

由于数据实际上相当大，我会考虑将其写入 Parquet 并使用 PyArrow ( Reading and Writing the Apache Parquet Format ) 直接在 Python 中读回，完全跳过所有中间阶段。

关于pandas - pyspark/EMR 中大型 DataFrame 上的collect() 或 toPandas()，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48692365/

文章推荐： java - 在 JavaFX 中添加/删除动画

文章推荐： python - 从数据框中按索引删除行

文章推荐： java - 过滤后的组合框中的文本编辑器不显示文本

文章推荐： java - 将 sLine 字符串解析为整数时出错

sql - Postgres 中大 INSERT 后查询缓慢
我们在 RedHat 中使用 Postgres 9.2。我们有一个类似于以下的表: CREATE TABLE BULK_WI ( BULK_ID INTEGER NOT NULL, U
c - printf() 中大 float 的奇怪行为并分配给一个 int
根据我的计算，将浮点值转换为计算机存储的二进制值(符号、指数、尾数格式)，在 32 位中，1 位用于符号，8 位用于指数。所以只剩下 23 位来表示数字。所以我认为具有正确行为的浮点值范围仅为 0
mysql - 使用 InnoDB 引擎比较 MySQL 中大 'text' 类型值的最有效方法
我有一个像这样的临时表: CREATE TABLE `staging` ( `created_here_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTA
html - 列表元素符号在 Mozilla 和 IE 中的间距比在 Chrome 中大
下面是我的 HTML: Fact Sheet Facilities and Administrative (F&A) Cost Agreem
java - 为什么 .add(i, E) 在 Java ArrayList 中大 O(n)？
我想知道为什么 .add(i, E) 是 O(n) 而 .get(i) 是 O(1)？是不是因为 n 元素在插入后必须向右移动？最佳答案记住大 O 表示法显示问题的数量级而不是最佳情况解决方案..
c++ - 对于相同的 c++ 源文件，其 gcc 可执行文件在 Windows 中比在 Linux 中大 655 倍。为什么差别这么大？
我在装有 GCC 4.8.2 的 Windows 8.1、Intel i7-3517U 64 位笔记本电脑上测试这个简单的 C++ 代码。 #include using namespace std;

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

pandas - pyspark/EMR 中大型 DataFrame 上的collect() 或 toPandas()