gpt4 book ai didi

pandas - 将 Pandas 或 Pyspark 数据帧从 Databricks 保存到 Azure Blob 存储

转载 作者:行者123 更新时间:2023-12-03 06:35:25 35 4
gpt4 key购买 nike

有没有办法可以将 Pyspark 或 Pandas 数据帧从 Databricks 保存到 Blob 存储,而无需挂载或安装库?

将存储容器安装到 Databricks 并使用库 com.crealytics.spark.excel 后,我能够实现此目的,但我想知道是否可以在没有库的情况下执行相同的操作安装是因为我将在没有这 2 个权限的集群上工作。

最佳答案

这里是将数据帧本地保存到 dbfs 的代码。

# export 
from os import path

folder = "export"
name = "export"
file_path_name_on_dbfs = path.join("/tmp", folder, name)

# Writing to DBFS
# .coalesce(1) used to generate only 1 file, if the dataframe is too big this won't work so you'll have multiple files qnd you need to copy them later one by one
sampleDF \
.coalesce(1) \
.write \
.mode("overwrite") \
.option("header", "true") \
.option("delimiter", ";") \
.option("encoding", "UTF-8") \
.csv(file_path_name_on_dbfs)

# path of destination, which will be sent to az storage
dest = file_path_name_on_dbfs + ".csv"

# Renaming part-000...csv to our file name
target_file = list(filter(lambda file: file.name.startswith("part-00000"), dbutils.fs.ls(file_path_name_on_dbfs)))
if len(target_file) > 0:
dbutils.fs.mv(target_file[0].path, dest)
dbutils.fs.cp(dest, f"file://{dest}") # this line is added for community edition only cause /dbfs is not recognized, so we copy the file locally
dbutils.fs.rm(file_path_name_on_dbfs,True)

将文件发送到 az 存储的代码:

import requests

sas="YOUR_SAS_TOKEN_PREVIOUSLY_CREATED" # follow the link below to create SAS token (using sas is slightly more secure than raw key storage)
blob_account_name = "YOUR_BLOB_ACCOUNT_NAME"
container = "YOUR_CONTAINER_NAME"
destination_path_w_name = "export/export.csv"
url = f"https://{blob_account_name}.blob.core.windows.net/{container}/{destination_path_w_name}?{sas}"

# here we read the content of our previously exported df -> csv
# if you are not on community edition you might want to use /dbfs + dest
payload=open(dest).read()

headers = {
'x-ms-blob-type': 'BlockBlob',
'Content-Type': 'text/csv' # you can change the content type according to your needs
}

response = requests.request("PUT", url, headers=headers, data=payload)

# if response.status_code is 201 it means your file was created successfully
print(response.status_code)

Follow this link to setup a SAS token

请记住,获得 sas token 的任何人都可以访问您的存储,具体取决于您在创建 sas token 时设置的权限

Excel导出版本的代码(使用com.crealytics:spark-excel_2.12:0.14.0)

保存数据框:

data = [ 
('a',25,'ast'),
('b',15,'phone'),
('c',32,'dlp'),
('d',45,'rare'),
('e',60,'phq' )
]
colums = ["column1" ,"column2" ,"column3"]
sampleDF = spark.createDataFrame(data=data, schema = colums)
sampleDF.show()

# export
from os import path
folder = "export"
name = "export"
file_path_name_on_dbfs = path.join("/tmp", folder, name)

# Writing to DBFS
sampleDF.write.format("com.crealytics.spark.excel")\
.option("header", "true")\
.mode("overwrite")\
.save(file_path_name_on_dbfs + ".xlsx")

# excel
dest = file_path_name_on_dbfs + ".xlsx"
dbutils.fs.cp(dest, f"file://{dest}") # this line is added for community edition only cause /dbfs is not recognized, so we copy the file locally

将文件上传到 azure 存储:

import requests

sas="YOUR_SAS_TOKEN_PREVIOUSLY_CREATED" # follow the link below to create SAS token (using sas is slightly more secure than raw key storage)
blob_account_name = "YOUR_BLOB_ACCOUNT_NAME"
container = "YOUR_CONTAINER_NAME"
destination_path_w_name = "export/export.xlsx"
# destination_path_w_name = "export/export.csv"
url = f"https://{blob_account_name}.blob.core.windows.net/{container}/{destination_path_w_name}?{sas}"

# here we read the content of our previously exported df -> csv
# if you are not on community edition you might want to use /dbfs + dest
# payload=open(dest).read()
payload=open(dest, 'rb').read()

headers = {
'x-ms-blob-type': 'BlockBlob',
# 'Content-Type': 'text/csv'
'Content-Type': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
}

response = requests.request("PUT", url, headers=headers, data=payload)

# if response.status_code is 201 it means your file was created successfully
print(response.status_code)

关于pandas - 将 Pandas 或 Pyspark 数据帧从 Databricks 保存到 Azure Blob 存储,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74912841/

35 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com