gpt4 book ai didi

azure - 如何通过 Python API 从 azure 数据湖高效下载整个目录?

转载 作者:行者123 更新时间:2023-12-02 07:56:06 24 4
gpt4 key购买 nike

我在 azure 数据湖中有一些数据(几 GB),分布在多个文件中,每个文件大小为 2MB。我想编写一个下载脚本来获取完整目录。到目前为止,我一直在尝试类似于 tutorial 的方法

azure_service_client = DataLakeServiceClient.from_connection_string(azure_connection_string)
file_system_client = service_client.get_file_system_client(file_system="my-file-system")
parent_directory_client = file_system_client.get_directory_client("my-directory")

for file_path in azure_all_files:
file_client = parent_directory_client.get_file_client(file_path)
download = file_client.download_file()
downloaded_bytes = download.readall()

target_path = os.path.join(self.local_data_directory, file_path)
with open(target_path, 'wb') as file:
file.write(downloaded_bytes)

但这非常慢,每个文件大约 1 分钟,即每 MB 30 秒(不,这不是我的互联网连接)。我在这里缺少什么? Python API 不是合适的工具吗?上面的一些调用是多余的吗?可以并行吗?

最佳答案

我认为我们可以在 azure.datalake.store 包中使用 ADLDownloader Class 来提高下载速率。它启动多个线程以实现高效下载,并为每个线程分配 block 大小。远程路径可以是单个文件、文件目录或全局模式。示例是 here

伪代码如下:

tenant_id = '<your Azure AD tenant id>'
username = '<your username in AAD>'
password = '<your password>'
store_name = '<your ADL name>'
token = lib.auth(tenant_id, username, password)
# Or you can register an app to get client_id and client_secret to get token
# If you want to apply this code in your application, I recommended to do the authentication by client
# client_id = '<client id of your app registered in Azure AD, like xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx'
# client_secret = '<your client secret>'
# token = lib.auth(tenant_id, client_id=client_id, client_secret=client_secret)
adl = core.AzureDLFileSystem(token, store_name=store_name)
ADLDownloader(adl, file_path, target_path, nthreads=None, chunksize=268435456, buffersize=4194304,blocksize=4194304,client=None, run=True, overwrite=False, verbose=False, progress_callback=None, timeout=0)


关于azure - 如何通过 Python API 从 azure 数据湖高效下载整个目录?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65571888/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com