amazon-web-services - AWS 胶水 : How to add a column with the source filename in the output?-6ren

amazon-web-services - AWS 胶水 : How to add a column with the source filename in the output?

转载作者：行者123 更新时间：2023-12-04 18:56:59

有谁知道在 Glue 作业中将源文件名添加为列的方法吗？

我们创建了一个流，我们在其中抓取 S3 中的一些文件以创建架构。然后，我们编写了一个将文件转换为新格式的作业，并将这些文件作为 CSV 写回另一个 S3 存储桶，供我们管道的其余部分使用。我们想要做的是访问某种作业元属性，以便我们可以向包含原始文件名的输出文件添加一个新列。

我查看了 AWS 文档和 aws-glue-libs 源代码，但没有看到任何跳出的内容。理想情况下，应该有一些方法可以从 awsglue.job 中获取元数据。包(我们使用的是 python 风格)。

我仍在学习 Glue，如果我使用了错误的术语，请见谅。我也用 spark 标签标记了它，因为我相信这就是 Glue 在幕后使用的东西。

最佳答案

使用 AWS Glue Python 自动生成的脚本，我添加了以下几行:

from pyspark.sql.functions import input_file_name

## Add the input file name column
datasource1 = datasource0.toDF().withColumn("input_file_name", input_file_name())

## Convert DataFrame back to DynamicFrame
datasource2 = datasource0.fromDF(datasource1, glueContext, "datasource2")

然后，在 ApplyMapping或 datasink部分代码，请引用 datasource2 .

关于amazon-web-services - AWS 胶水 : How to add a column with the source filename in the output?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50277563/

文章推荐： python-3.x - 使用 Tweepy 访问 Twitter 的高级 API

文章推荐： ubuntu - dotnet 核心发布问题

文章推荐： git - 在 ubuntu 18.04.2LTS 中安装 git 时出现错误

文章推荐： r - 根据其他列的条件提取数据帧行的子集

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

amazon-web-services - AWS 胶水 : How to add a column with the source filename in the output?