gpt4 book ai didi

How to write dataframe on s3 bucket by pyspark but without using Hadoop(如何在S3存储桶上不使用Hadoop的情况下使用PYSPARK写入数据帧)

转载 作者:bug小助手 更新时间:2023-10-24 19:43:29 28 4
gpt4 key购买 nike



I Want to write a dataframe directly on s3 bucket by pyspark but dont want to use Hadoop any how. Not single word of Hadoop required in python or pyspark code.

我想直接在S3存储桶上写一个dataframe,但不想使用Hadoop。在python或pyspark代码中不需要Hadoop的一个单词。


from pyspark.sql import SparkSession

aws_access_key_id = 'ABC'
aws_secret_access_key = 'XYZ'
region_name = 'ap-south-1'
bucket_name = 'integration'
folder_name = 'NGETL-POC'

# Initialize Spark session
spark = SparkSession.builder.appName('temp1') \
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider") \
.config("spark.hadoop.fs.s3a.access.key", aws_access_key_id) \
.config("spark.hadoop.fs.s3a.secret.key", aws_secret_access_key) \
.config("spark.hadoop.fs.s3a.endpoint", f"s3-{region_name}.amazonaws.com") \
.getOrCreate()

data1 = [(1, 'abc', 'A', 1), (2, 'pqr', 'B', 2), (3, 'efg', 'C', 4), (5, 'xyz', 'D', 6)]
fileHeadersColumns = ['student_id', 'st_name', 'st_class', 'st_roll_no']
df = spark.createDataFrame(data1, fileHeadersColumns)
df.show()

data2 = [(1, 'Maths', 50), (2, 'English', 60), (1, 'English', 70), (3, 'English', 80), (4, 'English', 40), (2, 'Maths', 60), (3, 'Maths', 70), (4, 'Maths', 80)]
redisColumns = ['student_id', 'subject', 'Marks']
df1 = spark.createDataFrame(data2, redisColumns)
df1.show()

joinedDf = df.join(df1, on="student_id", how="inner")
joinedDf.show()
print("file uploaded on S3")

# Write the DataFrame to S3 as a CSV
output_path = f"s3a://{bucket_name}/{folder_name}/data_1.csv"
joinedDf.write \
.option("header", "true") \
.option("delimiter", ",") \
.option("quoteAll", "true") \
.csv(output_path)
print("file uploaded on S3 post")

"""here I am using hadoop in config section{config("spark.hadoop.fs.s3a.aws.credentials.provider},I just only want to write this Dataframe (joinedDf) on s3 bucket without using hadoop.kindly, provide the solution as soon as possible."""

“这里我在配置section{config(”spark.hadoop.fs.s3a.aws.credentials.provider},中使用hadoop,我只想在S3存储桶上写入此数据帧(JoinedDf),而不使用hadoop。请尽快提供解决方案。“


更多回答

idownvotedbecau.se/noresearch

Idownvotedbecau.se/无研究

what are you trying or have tried? see this SO QnA.

您正在尝试或已经尝试了什么?看看这个,所以QNA。

@cruzlorite : sometimes I wonder is it the language barrier which is holding them back.

@Cruzlorite:有时我想知道是不是语言障碍阻碍了他们的发展。

Kindly write the Python code using PySpark. We need to write the DataFrame to an S3 bucket without using Hadoop.

请使用PySpark编写Python代码。我们需要在不使用Hadoop的情况下将DataFrame写入S3存储桶。

Language barrier has resolved @cruzlorite.kindly, resolve the issue

语言障碍已解决@Cruzlorit.请友好地解决此问题

优秀答案推荐

straighforward.

直截了当。



  1. Write a complete replacement for the s3a connector in your language of choice. 1 week, ignoring tests.



  2. Spark File Output code does use the hadoop filesystem APIs so you will need hadoop-common on the classpath unless you replace that too. Full specification is online, as are the compliance tests. 2-3 weeks to get the tests to work, unless you try to replace spark writer, which will take longer.



  3. You also need the code to commit the output in the presence of worker failure, knowing that directory renames are nonatomic and file rename is slow. EMR and S3A committers both use multipart uploads where workers write to final dest, propagate upload info to spark driver which then completes the upload in job commit. see "a zero rename committer" for details there. 4+ weeks.




Let us know how you get on.

让我们知道你的进展如何。


your other option is: write to a shared filesystem and then have spark driver use s3 command line tools to upload after. Again, your homework.

您的另一个选择是:写入到共享文件系统,然后让Spark驱动程序使用S3命令行工具进行上传。再说一次,你的作业。


更多回答

I didn't understand how to do it. Here is my code. I have created a CSV file on an S3 bucket using 's3a,' but I can't see the details. Can you help me with using S3 so that I can read the data inside the file? getting the error: get a "Py4JJavaError: An error occurred while calling o5082.csv." when trying to save to csv file.

我不知道该怎么做。这是我的代码。我已经使用‘S3A’在S3存储桶上创建了一个CSV文件,但我看不到详细信息。您能帮助我使用S3,这样我就可以读取文件中的数据了吗?获取错误:获取“Py4JJava错误:调用o5082.csv时出错。”尝试保存为CSV文件时。

sorry, learning how to debug someone else's open source code is a core part of modern open source development. All the s3a code is published in jar files, any of the IDEs will download it. Oh, and the s3a code is hadoop code

对不起,学习如何调试别人的开源代码是现代开源开发的核心部分。所有的S3A代码都发布在JAR文件中,任何IDE都可以下载它。哦,S3A代码是Hadoop代码

Thanks, Steve, for your invaluable assistance. However, I'm still grappling with the same question: the inability to upload a CSV in an S3 bucket ( using S3). Can we conclude that there is no schema in S3 for uploading a CSV? I prefer not to use s3a (Hadoop).Is there any method for uploading csv in S3.

谢谢你,史蒂夫,你的无价帮助。然而,我仍然在努力解决同样的问题:无法上传S3存储桶中的CSV(使用S3)。我们是否可以得出结论,S3中没有用于上传CSV的架构?我不喜欢使用S3A(Hadoop)。在S3中有什么方法可以上传CSV吗?

"I prefer not to use s3a (Hadoop)". well, you either use our or reimplement from scratch. your call

“我不喜欢使用S3A(Hadoop)”。那么,您要么使用我们的,要么从头开始重新实现。您的电话

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com