gpt4 book ai didi

python - 使用 pandas 将 BLOB 存储中的 .xlsx 转换为 .csv,无需下载到本地计算机

转载 作者:行者123 更新时间:2023-12-02 07:22:54 25 4
gpt4 key购买 nike

我正在处理从 .xlsx 文件到 .csv 的转换。我在本地测试了一个 python 脚本,该脚本从 blob 存储中的容器下载 .xlsx 文件、操作数据、将结果保存为 .csv 文件(使用 pandas)并将其上传到新容器上。现在我应该将 python 脚本引入 ADF 以构建管道来自动执行任务。我正在处理两种问题:

  1. 第一个问题:如果不在本地计算机上下载文件,我无法弄清楚如何完成任务。

我找到了这些线程/教程,但“azure”v5.0.0 元包已弃用 read excel files from "input" blob storage container and export to csv in "output" container with python

Tutorial: Run Python scripts through Azure Data Factory using Azure Batch

到目前为止我的代码是:

import os
import sys
import pandas as pd
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, PublicAccess

# Create the BlobServiceClient that is used to call the Blob service for the storage account
conn_str = 'XXXX;EndpointSuffix=core.windows.net'
blob_service_client = BlobServiceClient.from_connection_string(conn_str=conn_str)
container_name = "input"
blob_name = "prova/excel/AAA_prova1.xlsx"

container = ContainerClient.from_connection_string(conn_str=conn_str, container_name=container_name)
downloaded_blob = container.download_blob(blob_name)
df = pd.read_excel(downloaded_blob.content_as_bytes(), skiprows = 4)
data = df.to_csv (r'C:\mypath/AAA_prova2.csv' ,encoding='utf-8-sig', index=False)
full_path_to_file = r'C:\mypath/AAA_prova2.csv'
local_file_name = 'prova\csv\AAA_prova2.csv'

#upload in blob
blob_client = blob_service_client.get_blob_client(
container=container_name, blob=local_file_name)
with open(full_path_to_file, "rb") as data:
blob_client.upload_blob(data)
  • 第二个问题:使用此方法我只能处理 blob 的特定名称,但将来我必须参数化脚本(即仅选择以 AAA_ 开头的 blob 名称)。我不明白是否必须在 python 脚本中处理这个问题,或者是否可以设法通过 ADF 过滤文件(即在运行 python 脚本之前添加过滤文件任务)。我找不到任何教程/代码片段,如果有任何帮助、提示或文档,我们将非常感激。
  • 编辑

    我修改了代码以避免下载到本地计算机,现在它可以工作了(问题#1已解决)

    from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
    from io import BytesIO
    import pandas as pd

    filename = "excel/prova.xlsx"

    container_name="input"

    blob_service_client = BlobServiceClient.from_connection_string("XXXX==;EndpointSuffix=core.windows.net")
    container_client=blob_service_client.get_container_client(container_name)
    blob_client = container_client.get_blob_client(filename)
    streamdownloader=blob_client.download_blob()

    stream = BytesIO()
    streamdownloader.download_to_stream(stream)

    df = pd.read_excel(stream, skiprows = 5)


    local_file_name_out = "csv/prova.csv"
    container_name_out = "input"

    blob_client = blob_service_client.get_blob_client(
    container=container_name_out, blob=local_file_name_out)
    blob_client.upload_blob(df.to_csv(path_or_buf = None , encoding='utf-8-sig', index=False))

    最佳答案

    Azure Functions,Azure 函数的 Python 3.8 版本。等待来自 Excel 的 blob 触发器。然后做一些事情并使用大部分代码来最终完成。

    注意分割以删除文件名的 .xlsx。

    这就是我最终得到的结果:

    source_blob = (f"https://{account_name}.blob.core.windows.net/{uploadedxlsx.name}")
    file_name = uploadedxlsx.name.split("/")[2]
    container_name = "container"
    container_client=blob_service_client.get_container_client(container_name)
    blob_client = container_client.get_blob_client(f"Received/{file_name}")
    streamdownloader=blob_client.download_blob()

    stream = BytesIO()
    streamdownloader.download_to_stream(stream)

    df = pd.read_excel(stream)

    file_name_t = file_name.split(".")[0]

    local_file_name_out = f"Converted/{file_name_t}.csv"
    container_name_out = "out_container"

    blob_client = blob_service_client.get_blob_client(
    container=container_name_out, blob=local_file_name_out)
    blob_client.upload_blob(df.to_csv(path_or_buf = None , encoding='utf-8-sig', index=False))

    关于python - 使用 pandas 将 BLOB 存储中的 .xlsx 转换为 .csv,无需下载到本地计算机,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61859634/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com