gpt4 book ai didi

hadoop - 从S3解压缩

转载 作者:行者123 更新时间:2023-12-02 21:04:42 25 4
gpt4 key购买 nike

我正在使用Spark(pyspark)读取数据,由于我的一些数据现在是.gz格式,因此遇到了麻烦。

%pyspark
data = sc.textFile("s3://mybucket.file.gz")
data.first()


Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-5205743886772607083.py", line 267, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-5205743886772607083.py", line 260, in <module>
exec(code)
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/rdd.py", line 1041, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/usr/lib/spark/python/pyspark/rdd.py", line 1032, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/usr/lib/spark/python/pyspark/rdd.py", line 906, in fold
vals = self.mapPartitions(func).collect()
File "/usr/lib/spark/python/pyspark/rdd.py", line 809, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 4 times, most recent failure: Lost task 0.3 in stage 31.0 (TID 270, ip-172-16-238-231.us-west-1.compute.internal, executor 17): java.io.IOException: incorrect header check
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)

关于如何读入并解压缩有任何想法吗?

最佳答案

奇怪的是,我要做的就是从最后删除“gz”,它确实起作用了。

关于hadoop - 从S3解压缩,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42302477/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com