How to write dataframe on s3 bucket by pyspark but without using Hadoop(如何在S3存储桶上不使用Hadoop的情况下使用PYSPARK写入数据帧)-6ren

How to write dataframe on s3 bucket by pyspark but without using Hadoop(如何在S3存储桶上不使用Hadoop的情况下使用PYSPARK写入数据帧)

转载作者：bug小助手更新时间：2023-10-24 19:43:29

I Want to write a dataframe directly on s3 bucket by pyspark but dont want to use Hadoop any how. Not single word of Hadoop required in python or pyspark code.

我想直接在S3存储桶上写一个dataframe，但不想使用Hadoop。在python或pyspark代码中不需要Hadoop的一个单词。

from pyspark.sql import SparkSession

aws_access_key_id = 'ABC'
aws_secret_access_key = 'XYZ'
region_name = 'ap-south-1'
bucket_name = 'integration'
folder_name = 'NGETL-POC'

# Initialize Spark session
spark = SparkSession.builder.appName('temp1') \
    .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider") \
    .config("spark.hadoop.fs.s3a.access.key", aws_access_key_id) \
    .config("spark.hadoop.fs.s3a.secret.key", aws_secret_access_key) \
    .config("spark.hadoop.fs.s3a.endpoint", f"s3-{region_name}.amazonaws.com") \
    .getOrCreate()

data1 = [(1, 'abc', 'A', 1), (2, 'pqr', 'B', 2), (3, 'efg', 'C', 4), (5, 'xyz', 'D', 6)]
fileHeadersColumns = ['student_id', 'st_name', 'st_class', 'st_roll_no']
df = spark.createDataFrame(data1, fileHeadersColumns)
df.show()

data2 = [(1, 'Maths', 50), (2, 'English', 60), (1, 'English', 70), (3, 'English', 80), (4, 'English', 40), (2, 'Maths', 60), (3, 'Maths', 70), (4, 'Maths', 80)]
redisColumns = ['student_id', 'subject', 'Marks']
df1 = spark.createDataFrame(data2, redisColumns)
df1.show()

joinedDf = df.join(df1, on="student_id", how="inner")
joinedDf.show()
print("file uploaded on S3")

# Write the DataFrame to S3 as a CSV
output_path = f"s3a://{bucket_name}/{folder_name}/data_1.csv"
joinedDf.write \
    .option("header", "true") \
    .option("delimiter", ",") \
    .option("quoteAll", "true") \
    .csv(output_path)
print("file uploaded on S3 post")

"""here I am using hadoop in config section{config("spark.hadoop.fs.s3a.aws.credentials.provider},I just only want to write this Dataframe (joinedDf) on s3 bucket without using hadoop.kindly, provide the solution as soon as possible."""

“这里我在配置section{config(”spark.hadoop.fs.s3a.aws.credentials.provider}，中使用hadoop，我只想在S3存储桶上写入此数据帧(JoinedDf)，而不使用hadoop。请尽快提供解决方案。“

更多回答

idownvotedbecau.se/noresearch

Idownvotedbecau.se/无研究

what are you trying or have tried? see this SO QnA.

您正在尝试或已经尝试了什么？看看这个，所以QNA。

@cruzlorite : sometimes I wonder is it the language barrier which is holding them back.

@Cruzlorite：有时我想知道是不是语言障碍阻碍了他们的发展。

Kindly write the Python code using PySpark. We need to write the DataFrame to an S3 bucket without using Hadoop.

请使用PySpark编写Python代码。我们需要在不使用Hadoop的情况下将DataFrame写入S3存储桶。

Language barrier has resolved @cruzlorite.kindly, resolve the issue

语言障碍已解决@Cruzlorit.请友好地解决此问题

优秀答案推荐

straighforward.

直截了当。

Write a complete replacement for the s3a connector in your language of choice. 1 week, ignoring tests.

Spark File Output code does use the hadoop filesystem APIs so you will need hadoop-common on the classpath unless you replace that too. Full specification is online, as are the compliance tests. 2-3 weeks to get the tests to work, unless you try to replace spark writer, which will take longer.

You also need the code to commit the output in the presence of worker failure, knowing that directory renames are nonatomic and file rename is slow. EMR and S3A committers both use multipart uploads where workers write to final dest, propagate upload info to spark driver which then completes the upload in job commit. see "a zero rename committer" for details there. 4+ weeks.

Let us know how you get on.

让我们知道你的进展如何。

your other option is: write to a shared filesystem and then have spark driver use s3 command line tools to upload after. Again, your homework.

您的另一个选择是：写入到共享文件系统，然后让Spark驱动程序使用S3命令行工具进行上传。再说一次，你的作业。

更多回答

I didn't understand how to do it. Here is my code. I have created a CSV file on an S3 bucket using 's3a,' but I can't see the details. Can you help me with using S3 so that I can read the data inside the file? getting the error: get a "Py4JJavaError: An error occurred while calling o5082.csv." when trying to save to csv file.

我不知道该怎么做。这是我的代码。我已经使用‘S3A’在S3存储桶上创建了一个CSV文件，但我看不到详细信息。您能帮助我使用S3，这样我就可以读取文件中的数据了吗？获取错误：获取“Py4JJava错误：调用o5082.csv时出错。”尝试保存为CSV文件时。

sorry, learning how to debug someone else's open source code is a core part of modern open source development. All the s3a code is published in jar files, any of the IDEs will download it. Oh, and the s3a code is hadoop code

对不起，学习如何调试别人的开源代码是现代开源开发的核心部分。所有的S3A代码都发布在JAR文件中，任何IDE都可以下载它。哦，S3A代码是Hadoop代码

Thanks, Steve, for your invaluable assistance. However, I'm still grappling with the same question: the inability to upload a CSV in an S3 bucket ( using S3). Can we conclude that there is no schema in S3 for uploading a CSV? I prefer not to use s3a (Hadoop).Is there any method for uploading csv in S3.

谢谢你，史蒂夫，你的无价帮助。然而，我仍然在努力解决同样的问题：无法上传S3存储桶中的CSV(使用S3)。我们是否可以得出结论，S3中没有用于上传CSV的架构？我不喜欢使用S3A(Hadoop)。在S3中有什么方法可以上传CSV吗？

"I prefer not to use s3a (Hadoop)". well, you either use our or reimplement from scratch. your call

“我不喜欢使用S3A(Hadoop)”。那么，您要么使用我们的，要么从头开始重新实现。您的电话

ssl - Bucket SSL/Bucket 的高额账单？ - 谷歌云
我正在通过 Google Bucket 托管一个简单的静态网站，请注意:比尔看起来很眼熟吗？我对高使用率感到惊讶。是否存在用于 GoogleBucket 网站的 Hitcounter？如何使用 S
Couchbase buckets vs Ephemeral buckets(沙发底座桶与Ephemeral桶)
Couchbase存储桶是否也将数据存储在内存中？我想使用Couchbase存储桶创建实时排行榜系统，并运行四个不同的查询：。1-选择现有排名2-如果存在更新排名(排名+1)3-插入排名和更多数据，如
Couchbase buckets vs Ephemeral buckets(沙发桶VS短暂桶)
Couchbase存储桶是否也将数据存储在内存中？我想使用Couchbase存储桶创建实时排行榜系统，并运行四个不同的查询：。1-选择现有排名2-如果存在更新排名(排名+1)3-插入排名和更多数据，如
ElasticSearch:获取 bucket scripted_metric 中的 bucket key
我正在尝试在 elasticsearch 中运行此查询。我正在尝试在我的存储桶上运行自定义 scripted_metric 聚合。在指标脚本中，我想访问聚合它的存储桶 key 。我在 ES 中的文档
hadoop - Hive Buckets——理解TABLESAMPLE(BUCKET X OUT OF Y)
您好，我是 Hive 的新手，我已经了解了 hadoop 中的桶概念，但未能理解以下几行。有人可以帮助我吗？ SELECT avg(viewTime) FROM page_view TABLESAM
hadoop - Impala 是否在 Hive Bucketed 表中有效使用 Buckets？
我正在改进表格的性能。说这个表: CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
android - App Standby Buckets - "Never"bucket有什么限制(bucket 50)
Android documentation提到操作系统对以下每个存储桶的应用程序施加的限制:“Activity ”、“工作集”、“频繁”和“稀有”。唯一documentation我发现“从不”存储桶
hadoop - Hive Bucketing - 如何为特定的 bucket 运行 hive 查询
我有一个配置单元查询，它读取 5 个大表并将记录输出到下一个进程。所有这些表都在 proc_dt 上分区并在 user_id 上分桶(5 个桶)。联接在 user_id 上完成，过滤在 proc_dt
symfony - SonataMediaBundle - S3 AWS : 'The configured bucket "my-bucket"does not exist
我正在尝试在我的 Sonata 项目上配置 AWS s3 文件系统，但我总是收到以下错误: The configured bucket "my-bucket" does not exist. 我的 s
python - 从 Google Cloud Storage Bucket 复制到 S3 Bucket
我已经设置了一个 airflow 工作流，将一些文件从 s3 提取到 Google Cloud 存储，然后运行 sql 查询工作流以在 Big Query 上创建新表。在工作流程结束时，我需要将最
amazon-web-services - AWS CDK 错误 : bucket policy already exists on bucket
我正在尝试在 Java 中使用 CDK 创建一个 lambda 和一个 DynamoDB。当我尝试执行“CDK 部署”时，我遇到以下错误: 11:20:30 AM | CREATE_FAILED
java - AWS Lambda : How to extract a tgz file in a S3 bucket and put it in another S3 bucket
我有一个名为“Source”的 S3 存储桶。许多“.tgz”文件被实时推送到该存储桶中。我编写了一个 Java 代码来提取“.tgz”文件并将其推送到“目标”存储桶中。我将代码作为 Lambda 函
google-cloud-platform - GCloud Storage : How to grant permission to see buckets in console but only see files in single bucket?
好的，这使我无法忍受，真是太复杂了…… 因此，要达到主题的目的，而又不授予用户对所有存储桶中所有文件的读取权限(proj中的其他存储桶都具有敏感数据) 我导航到存储桶->权限，并将用户添加为Stora
google-cloud-platform - 尝试从 Google Bucket 托管静态网站时出现 "Access denied: Anonymous users does not have storage.objects.list access to bucket"
我正在尝试按照 https://cloud.google.com/storage/docs/hosting-static-website 上的说明进行操作从 Google Bucket 托管静态网站。
amazon-web-services - AWS CloudTrail Create API for Go SDK 抛出错误消息 "InsufficientS3BucketPolicyException: Incorrect S3 bucket policy is detected for bucket: "
我正在尝试使用 Go SDK 创建一个 cloudtrail。按照 AWS 文档成功连接 AWS，没有任何问题。我按照以下步骤创建跟踪第 1 步 - 创建 S3 存储桶，以便所有跟踪日志文件都可以
哈希表 : why buckets?
据我所知，散列函数的目的是尽可能均匀地分发数据，当您发生冲突时，您有多种选择: 寻找下一个空槽生成不同的散列并尝试将其粘贴到其他地方把它放在一个溢出容器中(可以是一个列表、另一个哈希表或其他任何东
bucket - 漏桶问题有帮助吗？
我正在努力复习我的期末考试，我正在复习我的教授给我的示例问题。谁能向我解释漏桶如何工作的概念。另外，这是我的教授给我的关于漏桶的复习问题。一个漏桶位于主机网络接口(interface)。网络中的数据
php - 如何直接从url上传文件到S3 bucket
我从我的用户那里收到了一些彩信。这些彩信是通过 twilio 发送的。所以 twilio 将这些文件存储到他们的服务器中，我可以从 twilio 访问这些文件。但就我而言，我需要将这些文件存储到 S3
php - 如何直接从url上传文件到S3 bucket
我从我的用户那里收到了一些彩信。这些彩信是通过 twilio 发送的。所以 twilio 将这些文件存储到他们的服务器中，我可以从 twilio 访问这些文件。但就我而言，我需要将这些文件存储到 S3
c# - 如何将一个集合分成不同的 "buckets"
我有一组 C# 对象。对于数据成员，每个对象都有一个 guid 字符串、一个 int 索引和一个文档名称字符串。这是一个典型的集合的样子: "guid1","c:\temp\doc1.docx",1

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

How to write dataframe on s3 bucket by pyspark but without using Hadoop(如何在S3存储桶上不使用Hadoop的情况下使用PYSPARK写入数据帧)