gpt4 book ai didi

apache-spark - EMR 5.28 无法在 s3 上加载 Parquet 文件

转载 作者:行者123 更新时间:2023-12-04 05:11:23 27 4
gpt4 key购买 nike

在 EMR 集群 5.28.0 上,从 s3 读取 Parquet 文件失败并出现以下异常,而在 EMR 5.18.0 上同样可以正常工作。下面是 EMR 5.28.0 上的堆栈跟踪。

我什至从 spark-shell 尝试过:

sqlContext.read.load(("s3://s3_file_path/*")
df.take(5)

但失败并出现相同的异常:

Job aborted due to stage failure: Task 3 in stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage 1.0 (TID 17, ip-x.x.x.x.ec2.internal, executor 1): **org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://somedir/somesubdir/434560/1658_1564419581.parquet, range: 0-7928, partition values: [empty row], isDataPresent: false**
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.next(AsyncFileDownloader.scala:142)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.getNextFile(FileScanRDD.scala:241)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:171)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
**Caused by: java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.org$apache$spark$sql$execution$datasources$parquet$ParquetFileFormat$$isCreatedByParquetMr(ParquetFileFormat.scala:352)
at** org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:676)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildPrefetcherWithPartitionValues$1.apply(ParquetFileFormat.scala:579)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader.org$apache$spark$sql$execution$datasources$AsyncFileDownloader$$downloadFile(AsyncFileDownloader.scala:93)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:73)
at org.apache.spark.sql.execution.datasources.AsyncFileDownloader$$anonfun$initiateFilesDownload$2$$anon$1.call(AsyncFileDownloader.scala:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more

我找不到这个文档。有没有人在 EMR 5.28.0 上遇到过这个问题并且能够解决这个问题?

在 5.28 上,我能够读取由 EMR 写入 s3 的文件,但读取由 parquet-go 写入的现有 parquet 文件会抛出上述异常,而它在 EMR 5.18 上运行良好

更新:在检查 Parquet 文件时,仅适用于 5.18 的旧文件缺少统计数据

creator:            null 
file schema: parquet-go-root
timestringhr: BINARY SNAPPY DO:0 FPO:21015 SZ:1949/25676/13.17 VC:1092 ENC:RLE,BIT_PACKED,PLAIN ST:[no stats for this column]
timeseconds: INT64 SNAPPY DO:0 FPO:22964 SZ:1397/9064/6.49 VC:1092 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 1564419460, max: 1564419581, num_nulls not defined]

在 EMR 5.18 和 5.28 上工作的那些就像

creator:            parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) 
extra: org.apache.spark.sql.parquet.row.metadata = {<schema_here>}
timestringhr: BINARY SNAPPY DO:0 FPO:3988 SZ:156/152/0.97 VC:1092 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 2019-07-29 16:00:00, max: 2019-07-29 16:00:00, num_nulls: 0]
timeseconds: INT64 SNAPPY DO:0 FPO:4144 SZ:954/1424/1.49 VC:1092 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 1564419460, max: 1564419581, num_nulls: 0]

这可能会导致 NullPointerException。发现与 parquet-mr 相关的问题 https://issues.apache.org/jira/browse/PARQUET-1217。我可以尝试在类路径中包含更新版本的 parquet 或在 EMR 6 beta 上进行测试,看看是否能解决问题。

最佳答案

尝试将 created_by 值添加到页脚。我将一个 NPE 追踪到 Spark 中的 footer/created_by 检查。如果您正在使用 xitongsys/parquet-go ,请考虑一下:

var writer_version = "parquet-go version 1.0"
...

...
pw, err := writer.NewJSONWriter(schemaStr, fw, 4)
pw.Footer.CreatedBy = &writer_version

关于apache-spark - EMR 5.28 无法在 s3 上加载 Parquet 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59236709/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com