gpt4 book ai didi

python - 从 dataproc 中运行的 pyspark 作业将 python 数据对象保存到谷歌存储中的文件

转载 作者:太空宇宙 更新时间:2023-11-04 02:33:00 28 4
gpt4 key购买 nike

我在使用 dataproc 运行 pyspark 作业时收集指标,但我无法将它们保存在谷歌存储中(仅使用 python 函数,而不是 Spark)。

关键是我可以保存它们并在执行过程中成功读取和修改它们,但是当作业结束时,我的 google 存储文件夹中没有任何内容。

是否可以持久化 python 对象,或者这只能使用 pyspark 库?

编辑:我添加一个代码片段来澄清问题

# Python
import pandas as pd

# Pyspark
from pyspark.sql import SparkSession

# Google storage filepath
filepath = 'gs://[PATH]/'

spark_session = SparkSession.builder.getOrCreate()

sdf = spark_session.createDataFrame([[1],[2],[3],[4],[5]], ['col'])
pdf = pd.DataFrame([1,2,3,4,5], columns=['col'])

# Save the pandas dataframe (THIS IS NOT PERFORMED IN MY BUCKET)
pdf.to_pickle(filepath + 'pickle.pkl' )

# Save the spark dataframe (THIS IS PERFORMED IN MY BUCKET)
sdf.write.csv(filepath + 'spark_dataframe.csv')

# read pickle (THIS WORKS BUT ONLY DURING THIS JOB EXECUTION,
# IT'S NOT ACCESSIBLE BY ME, maybe its in some temporal folder only)
df_read = pd.read_pickle(filepath + 'pickle.pkl' )

最佳答案

根据我之前的评论,我修改了您的示例以将 Pickle 对象复制到 GCS:

# Python
import pandas as pd
from subprocess import call
from os.path import join

# Pyspark
from pyspark.sql import SparkSession

# Google storage filepath
filepath = 'gs://BUCKET_NAME/pickle/'
filename = 'pickle.pkl'

spark_session = SparkSession.builder.getOrCreate()

sdf = spark_session.createDataFrame([[1],[2],[3],[4],[5]], ['col'])
pdf = pd.DataFrame([1,2,3,4,5], columns=['col'])

# Save the pandas dataframe locally
pdf.to_pickle('./gsutil/' + filename )
pdf.to_pickle('./distcp/' + filename )

# Synch with bucket
call(["gsutil","-m","cp",'./gsutil/',join(filepath,filename)])

call(["hadoop","fs","-put","./distcp/","/user/test/"])
call(["hadoop","distcp","/user/test/distcp/" + filename,join(filepath,"distcp/" + filename)])

此外,确保创建必要的文件夹(本地和 HDFS)并预先替换正确的 BUCKET_NAME 以使示例正常工作。

关于python - 从 dataproc 中运行的 pyspark 作业将 python 数据对象保存到谷歌存储中的文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48684048/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com