python - python fastparquet 模块可以读取压缩的 Parquet 文件吗？-6ren

python - python fastparquet 模块可以读取压缩的 Parquet 文件吗？

转载作者：太空狗更新时间：2023-10-30 00:27:29

29

4

我们的 parquet 文件存储在 aws S3 存储桶中，并由 SNAPPY 压缩。我能够使用 python fastparquet 模块读取未压缩版本的 Parquet 文件，但不能读取压缩版本。

这是我用于未压缩的代码

s3 = s3fs.S3FileSystem(key='XESF',    secret='dsfkljsf')
myopen = s3.open
pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.parquet', open_with=myopen)
df=pf.to_pandas()

这不会返回任何错误，但是当我尝试读取文件的活泼压缩版本时:

pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.snappy.parquet', open_with=myopen)

我在使用 to_pandas() 时出错

df=pf.to_pandas()

错误信息

KeyErrorTraceback (most recent call last) in () ----> 1 df=pf.to_pandas()

/opt/conda/lib/python3.5/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index) 293 for (name, v) in views.items()} 294 self.read_row_group(rg, columns, categories, infile=f, --> 295 index=index, assign=parts) 296 start += rg.num_rows 297 else:

/opt/conda/lib/python3.5/site-packages/fastparquet/api.py in read_row_group(self, rg, columns, categories, infile, index, assign) 151 core.read_row_group( 152 infile, rg, columns, categories, self.helper, self.cats, --> 153 self.selfmade, index=index, assign=assign) 154 if ret: 155 return df

/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_row_group(file, rg, columns, categories, schema_helper, cats, selfmade, index, assign) 300 raise RuntimeError('Going with pre-allocation!') 301 read_row_group_arrays(file, rg, columns, categories, schema_helper, --> 302 cats, selfmade, assign=assign) 303 304 for cat in cats:

/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_row_group_arrays(file, rg, columns, categories, schema_helper, cats, selfmade, assign) 289 read_col(column, schema_helper, file, use_cat=use, 290 selfmade=selfmade, assign=out[name], --> 291 catdef=out[name+'-catdef'] if use else None) 292 293

/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_col(column, schema_helper, infile, use_cat, grab_dict, selfmade, assign, catdef) 196 dic = None 197 if ph.type == parquet_thrift.PageType.DICTIONARY_PAGE: --> 198 dic = np.array(read_dictionary_page(infile, schema_helper, ph, cmd)) 199 ph = read_thrift(infile, parquet_thrift.PageHeader) 200 dic = convert(dic, se)

/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in read_dictionary_page(file_obj, schema_helper, page_header, column_metadata) 152 Consumes data using the plain encoding and returns an array of values. 153 """ --> 154 raw_bytes = _read_page(file_obj, page_header, column_metadata) 155 if column_metadata.type == parquet_thrift.Type.BYTE_ARRAY: 156 # no faster way to read variable-length-strings?

/opt/conda/lib/python3.5/site-packages/fastparquet/core.py in _read_page(file_obj, page_header, column_metadata) 28 """Read the data page from the given file-object and convert it to raw, uncompressed bytes (if necessary).""" 29 raw_bytes = file_obj.read(page_header.compressed_page_size) ---> 30 raw_bytes = decompress_data(raw_bytes, column_metadata.codec) 31 32 assert len(raw_bytes) == page_header.uncompressed_page_size, \

/opt/conda/lib/python3.5/site-packages/fastparquet/compression.py in decompress_data(data, algorithm) 48 def decompress_data(data, algorithm='gzip'): 49 if isinstance(algorithm, int): ---> 50 algorithm = rev_map[algorithm] 51 if algorithm.upper() not in decompressions: 52 raise RuntimeError("Decompression '%s' not available. Options: %s" %

KeyError: 1

最佳答案

该错误可能表明在您的系统上未找到用于解压 SNAPPY 的库 - 尽管错误消息显然可以更清楚!

根据您的系统，以下行可能会为您解决此问题:

conda install python-snappy

或

pip install python-snappy

如果您在 Windows 上，构建链可能无法工作，也许您需要从 here 安装.

关于python - python fastparquet 模块可以读取压缩的 Parquet 文件吗？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42234944/

29

4

0

文章推荐： python - 在 Marshmallow Schema 中以编程方式定义字段

文章推荐： c# - 试着捕获问题

文章推荐： c# - Parallel.For 失败 (C#)

文章推荐： python - 在 Python 中循环遍历 JSON 数组

python - Fastparquet 似乎并没有推倒过滤器
我使用 dask 的 dataframe to_parquet 方法创建了一个 parquet 文件，并使用 fastparquet 作为引擎。使用 fastparquet.ParquetFile 读
python - dask 分布式 fastparquet 中的处理时间不一致
我有一个配置单元格式和快速压缩的 Parquet 文件。它适合内存，并且 pandas.info 提供以下数据。 parquet 文件中每组的行数仅为 100K >>> df.info() Inde
python - fastparquet 和 pyarrow 之间的比较？
经过一番搜索后，我未能找到 fastparquet 和 pyarrow 的彻底比较。我找到了这个博客 post (速度的基本比较)。和一个 github discussion声称使用 fastpa
python - python fastparquet 模块可以读取压缩的 Parquet 文件吗？
我们的 parquet 文件存储在 aws S3 存储桶中，并由 SNAPPY 压缩。我能够使用 python fastparquet 模块读取未压缩版本的 Parquet 文件，但不能读取压缩版本。
python-3.x - 解压 'SNAPPY' 不适用于 fastparquet
我正在尝试使用 fastparquet 打开文件，但出现错误: RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP
dask - 如何使用 dask/fastparquet 从多个目录读取多个 Parquet 文件(具有相同的架构)
我需要使用 dask 将具有相同架构的多个 Parquet 文件加载到单个数据框中。这在它们都在同一目录中时有效，但当它们在不同的目录中时无效。例如: import fastparquet pfil
python - 在 python 中导入 fastparquet 时 snappy 出错
我在已经安装了 python (3.6) 和 anaconda 的 EC2 服务器中安装了以下模块: 活泼的 pyarrow s3fs 快速拼花除了 fastparquet 其他一切都适用于导入。当
Python Pandas 使用 Fastparquet 将 CSV 转换为 Parquet
我在 PyCharm venv 中使用 Python 3.6 解释器，并尝试将 CSV 转换为 Parquet。 import pandas as pd df = pd.read_csv('/p
python - pyarrow 可以将多个 Parquet 文件写入 fastparquet 的 file_scheme ='hive' 选项之类的文件夹吗？
我有一个数百万条记录的 SQL 表，我打算使用 pyarrow 库将其写入文件夹中的许多 Parquet 文件。数据内容似乎太大，无法存储在单个 parquet 文件中。但是，我似乎无法在 pyar
python - Fastparquet 在使用 dataframe.to_parquet() 时给出 "TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO"
我正在尝试为 AWS Lambda 创建代码以将 csv 转换为 parquet。我可以使用 Pyarrow 做到这一点，但它的大小太大(约 200 MB 未压缩)，因此我无法在 Lambda 的部署

首页

博学

6Ren·AI

商城

python - python fastparquet 模块可以读取压缩的 Parquet 文件吗？