apache-spark - Spark sql : GC overhead limit exceeded when reading parquet partitioned files-6ren

apache-spark - Spark sql : GC overhead limit exceeded when reading parquet partitioned files

转载作者：行者123 更新时间：2023-12-04 12:03:37

26

4

我正在尝试使用我的 POC 的 spark sql 从 hdfs 读取现有的 Parquet 文件，但遇到了 OOM 错误。

我需要读取给定分区日期的所有分区文件。分区如下:date/file_dir_id

日期文件夹下有1200个子文件夹

所有这些文件夹下总共有 234769 个 .parquet 文件(不是很大)

所有 .parquet 文件的总大小为 10g

Parquet 文件夹结构

日期

文件目录_1

File_1.parquet

File_2.parquet

文件目录_2

File_3.parquet

当我尝试读取特定日期的文件时，上面提到的数字
sparkSession.read().schema(someSchema).parquet(hdfs_path_folder/date=2018-03-05/*);//我收到下面提到的错误。

其他详情

以 yarn /集群模式运行

Spark 2.3

4 节点集群(32 核/128 GB)

5 个执行器/每个 4 核

如果我增加驱动程序内存或执行程序内存没有帮助。关于如何克服这个问题有什么帮助吗？

错误详情

java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Unknown Source)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(Unknown Source)
at java.lang.AbstractStringBuilder.append(Unknown Source)
at java.lang.StringBuffer.append(Unknown Source)
at java.net.URI.appendSchemeSpecificPart(Unknown Source)
at java.net.URI.toString(Unknown Source)
at java.net.URI.<init>(Unknown Source)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$3$$anonfun$7.apply(InMemoryFileIndex.scala:235)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$3$$anonfun$7.apply(InMemoryFileIndex.scala:228)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$3.apply(InMemoryFileIndex.scala:228)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$3.apply(InMemoryFileIndex.scala:227)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles(InMemoryFileIndex.scala:227)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$listLeafFiles(InMemoryFileIndex.scala:273)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$1.apply(InMemoryFileIndex.scala:172)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$1.apply(InMemoryFileIndex.scala:171)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

最佳答案

当 spark 尝试从 parquet 读取时，它会在内部尝试构建一个
InMemoryFileIndex
在 spark 工作中，我们会看到这样的工作

Listing leaf files and directories for 1200 paths:

这个问题是因为要扫描的路径数太大
增加驱动内存和核心为我解决问题

 'driver.cores': 4,
 'driver.memory': '8g'

关于apache-spark - Spark sql : GC overhead limit exceeded when reading parquet partitioned files，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50353559/

26

4

0

文章推荐： javascript - 如何使用 Axios 发布二进制文件？

文章推荐： react-native - onLoadEnd 未在 native react 中触发

node.js - 从 Crontab 运行时的 Puppeteer "TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded"
我有一个 Node.JS 自动化，它使用 Puppeteer 并在过程中加载一些 URL。我的代码非常基本，仅使用包文档中记录的非常基本的函数。自动化计划每 15 分钟使用 crontab 运行一次
java - 一个应用程序显示无法执行 dex : GC overhead limit exceeded GC overhead limit exceeded in eclipse
我尝试阅读 stackoveflow 上回答的一些问题并根据此更改 eclipse.ini: 现在，除了一个应用程序之外，每个应用程序都可以正常运行。它显示此消息: 无法执行 dex:超出 GC 开销
竞技编程: "Time exceeded error"
问题描述: Task A. Amount of subtractions You have an array a length n. There are m queries (li,ri), for
c++ - 数组的大小 'exceeded' ，但数组很小
编辑:看起来问题是过度使用#includes 创建圈子。我确保只包括那些需要的，它解决了前两个错误。但是，我仍然为 BUtton 和 Elevator 得到“指定的多个默认构造函数” 每个错误都有两
ios - CloudKit批处理结果为 "Limit Exceeded"
在CloudKit中，我尝试通过批处理来保存大量的记录。但是，我的应用程序因以下错误而崩溃: Error pushing local data: 这是我的代码: CKModifyRecordsOpe
R bigrquery : Exceeded rate limits
我正在尝试使用以下代码将 BigQuery 数据集从 Google Cloud Platform 下载到 R 工作区以对其进行分析: library(bigrquery) library(DBI) l
Kubernetes 健康检查 : timeoutSeconds exceeds periodSeconds
在 Kubernetes 中 Kubernetes Health Check Probes ，如果 timeoutSeconds 超过 periodSeconds 会怎样？例如: initialDel
Youtube Quota Exceeded Exception 实际上不是
我们正在使用 youtube 数据 api v3，并且已经有一段时间没有任何问题了。最近，我们收到了这个 403 异常: The request cannot be completed because
Crashlytics生成符号gradle步骤失败: GC Overhead Limit Exceeded
我正在将一个项目从gradle版本3.3转换为4.10.1。该项目主要是使用自定义构建步骤构建的C++代码，而不是CMake(externalNativeBuild)或Android.mk(ndkBu
c++ - 我在下面的代码中收到错误 “Output Limit Exceeded”
这是我为查找小于或等于给定编号的跳跃数而编写的代码。它显示错误“超出输出限制” int main() { int t; cin>>t; while(t--) { long long int n
google-translate - 谷歌翻译: Quota Exceeded
我正在尝试使用 Google Translate REST API 并同时请求以下网址: http://ajax.googleapis.com/ajax/services/language/trans
.net - 诊断 "Quota Exceeded"Win32Exception
大多数时候，作为 .Net 开发人员，我们可以自由地在高级抽象世界中玩耍，但有时现实会踢你的私密部分，并告诉你要找到一个真的理解。我刚刚经历过其中一次。我认为将角落数据列为项目列表就足够了，以便您了
mongodb - Golang使用MongoDB报告 "context deadline exceeded"
我编写了一个更新函数，但是多次执行将产生错误context deadline exceeded。我的功能: func Update(link string, m bson.M) { conf
mysql错误: Lock wait timeout exceeded
我在我的网络服务器上同时使用 mysql 和 asp.net。我还使用 sqlite 数据库以便能够在另一台设备上使用该数据库。我需要在两个数据库之间发送数据。这是一天需要做很多次的事情。这是我如何做
Android TTS 为离线语音返回 "Quota exceeded"
我在我的应用程序中使用 Google TextToSpeech 已经很长时间了，我的许多用户都在使用离线语音，所以我对使用的资源数量没有任何问题。但是在收到 GoogleTTS 的最新更新后，我所有的
MySQL: "lock wait timeout exceeded"
我正在尝试从 MySQL 5.0.45 数据库中删除几行: delete from bundle_inclusions; 客户端工作了一段时间，然后返回错误: Lock wait timeout ex
MySQL集群错误: Lock wait timeout exceeded
我试图将一个 ~200G 的文件加载到具有 4 个数据节点的 MySQL 集群中，我的目标表的 DDL 是这样的: CREATE TABLE XXXXXX ( ID BIGINT AUTO
python递归函数报错RuntimeError : maximum recursion depth exceeded
我有这个脚本: def number_of_occurences(c, message): position = message.find(c) if position == -1:
swift - 单元测试异步等待失败: Exceeded timeout Swift
我正在尝试对我的应用程序进行单元测试，但大部分测试都失败了，原因是异步等待失败:超过 30 秒的超时时间，未满足预期:“Home Code”。我不知道为什么会这样失败，但这是我下面的代码 class
php - 文件上传 : "File exceeded max_file_size"
我的 HTML 表单是这样的但是，当我上传一个 3mb 的文件时，它给出错误: Problem: File exceeded max_file_size" 最佳答案我最后检查过，MAX_FIL

首页

博学

6Ren·AI

商城

apache-spark - Spark sql : GC overhead limit exceeded when reading parquet partitioned files