gpt4 book ai didi

azure - 如何使用服务器端的列投影将 Parquet 文件从 Azure Blob 读取到 Pandas DataFrame 中?

转载 作者:行者123 更新时间:2023-12-03 06:12:36 24 4
gpt4 key购买 nike

以下问题:How to read parquet files from Azure Blobs into Pandas DataFrame?

是否可以在下载之前在服务器级别对 parquet 文件执行列投影以提高效率? IE。我想在下载文件之前仅过滤所需的列。

目前,我仅通过连接字符串连接到 Azure 服务(如果有帮助的话)并使用 Python 客户端库。

最佳答案

Is it possible to perform a column projection on the parquet file at server level before downloading it to be more efficient? I.e. I would like to filter only desired columns before downloading the file.

要从 Azure Blob 存储中的 parquet 文件下载所需的列,可以使用以下 Python 代码:

代码:

import pyarrow.parquet as pq
from azure.storage.blob import BlobServiceClient
import pandas as pd

# Set up the BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string('your connection string')

# Get a reference to the Parquet file in Azure Blob Storage
blob_container_client = blob_service_client.get_container_client('test1')
blob_client = blob_container_client.get_blob_client('samplepar.parquet')

# Define the list of columns to read from the Parquet file
columns = ['title', 'salary', 'birthdate', 'id1', 'id2']
columns_query = ", ".join([f"[{column}]" for column in columns])
query = f"SELECT {columns_query} FROM BlobStorage"
with open("sample.parquet1", "wb") as file:
blob_client.download_blob().download_to_stream(file)

table = pq.read_table("sample1.parquet")
available_columns = [column for column in columns if column in table.column_names]
print(available_columns)
if available_columns:
table = table.select(available_columns)
df = table.to_pandas()
print(df)
else:
print("Error: None of the specified columns are present in the Parquet file.")

输出:

['title', 'salary', 'birthdate']
title salary birthdate
0 Internal Auditor 49756.53 3/8/1971
1 Accountant IV 150280.17 1/16/1968
2 Structural Engineer 144972.51 2/1/1960
3 Senior Cost Accountant 90263.05 4/8/1997

enter image description here

下载的文件:

enter image description here

关于azure - 如何使用服务器端的列投影将 Parquet 文件从 Azure Blob 读取到 Pandas DataFrame 中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76582862/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com