amazon-redshift - aws 胶水作业如何在 Redshift 中上传多个表-6ren

amazon-redshift - aws 胶水作业如何在 Redshift 中上传多个表

转载作者：行者123 更新时间：2023-12-04 07:16:14

是否可以使用 AWS Glue 作业在 Redshift 中加载多个表？

这些是我遵循的步骤。

从 S3 爬取 json，数据已转换为数据目录表。

我创建了一个将在 redshift 中上传数据目录表的作业，但它只限制我为每个作业上传 1 个表。在作业属性(在添加作业中)中，我选择的此作业运行选项是:AWS Glue 生成的建议脚本。

我不熟悉 python，我是 AWS Glue 的新手。但我有几个表需要上传。

这是一个示例脚本:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "sampledb", table_name = "abs", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "abs", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("value", "int", "value", "int"), ("sex", "string", "sex", "string"), ("age", "string", "age", "string"), ("highest year of school completed", "string", "highest year of school completed", "string"), ("state", "string", "state", "string"), ("region type", "string", "region type", "string"), ("lga 2011", "string", "lga 2011", "string"), ("frequency", "string", "frequency", "string"), ("time", "string", "time", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("value", "int", "value", "int"), ("sex", "string", "sex", "string"), ("age", "string", "age", "string"), ("highest year of school completed", "string", "highest year of school completed", "string"), ("state", "string", "state", "string"), ("region type", "string", "region type", "string"), ("lga 2011", "string", "lga 2011", "string"), ("frequency", "string", "frequency", "string"), ("time", "string", "time", "string")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_cols", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_cols", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [catalog_connection = "redshift", connection_options = {"dbtable": "abs", "database": "dbmla"}, redshift_tmp_dir = TempDir, transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "redshift", connection_options = {"dbtable": "abs", "database": "dbmla"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
job.commit()

aws 胶水数据库:sampledb

aws 胶中的表名:abs

Redshift 数据库:dbmla

请提供有关如何上传它们的示例。谢谢!

最佳答案

根据 AWS Glue 常见问题解答，您可以修改生成的代码并运行作业。

Q: How can I customize the ETL code generated by AWS Glue?

AWS Glue’s ETL script recommendation system generates Scala or Python code. It leverages Glue’s custom ETL library to simplify access to data sources as well as manage job execution. You can find more details about the library in our documentation. You can write ETL code using AWS Glue’s custom library or write arbitrary code in Scala or Python by using inline editing via the AWS Glue Console script editor, downloading the auto-generated code, and editing it in your own IDE. You can also start with one of the many samples hosted in our Github repository and customize that code.

因此，请尝试将其他表的代码片段添加到相同的脚本中，如下所示，

datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "abs2", transformation_ctx = "datasource1")
applymapping2 = ApplyMapping.apply(.. transformation_ctx = "applymapping2")
resolvechoice2 = ResolveChoice.apply(frame = applymapping2, choice = "make_cols", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "redshift", connection_options = {"dbtable": "abs2", "database": "dbmla"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")

datasource2 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "abs2", transformation_ctx = "datasource1")
applymapping2 = ApplyMapping.apply(.. transformation_ctx = "applymapping2")
resolvechoice2 = ResolveChoice.apply(frame = applymapping2, choice = "make_cols", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "redshift", connection_options = {"dbtable": "abs2", "database": "dbmla"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")

datasource3 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "abs2", transformation_ctx = "datasource1")
applymapping2 = ApplyMapping.apply(.. transformation_ctx = "applymapping2")
resolvechoice2 = ResolveChoice.apply(frame = applymapping2, choice = "make_cols", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "redshift", connection_options = {"dbtable": "abs2", "database": "dbmla"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")

job.commit()

相应地更改变量名称以使其唯一。谢谢

关于amazon-redshift - aws 胶水作业如何在 Redshift 中上传多个表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50459840/

文章推荐： ruby-on-rails-3 - 如何在 View 中检查 Rails 路线的存在

文章推荐： isolatedstorage - 如何获取隔离存储文件的完整路径

文章推荐： semantic-markup - 查找与名词输入相关的形容词

文章推荐： data-structures - 具有随机访问的自排序数据结构

amazon-redshift - 将表从一个 redshift 集群复制到另一个 redshift 集群(不使用 s3)
我们可以直接将一张表从一个 Redshift 集群复制到另一个 Redshift 集群吗？我知道可以使用 s3 作为临时存储来实现表复制(即从第一个集群卸载到 s3，然后从 s3 复制到另一个集群)
amazon-redshift - 使用 Redshift Spectrum 读取 AWS Redshift 外部表中的数据
我在 AWS Redshift 集群中执行了以下操作以从 S3 读取 Parquet 文件。 create external schema s3_external_schema from data c
amazon-redshift - 使用 Redshift Spectrum 读取 AWS Redshift 外部表中的数据
我在 AWS Redshift 集群中执行了以下操作以从 S3 读取 Parquet 文件。 create external schema s3_external_schema from data c
amazon-redshift - Redshift 列编码会影响查询执行速度吗？
在 Amazon Redshift 中创建数据表时，您可以指定各种 encodings，例如 MOSTLY32 或 BYTEDICT 或 LZO。这些是在磁盘上存储列值时使用的压缩。我想知道我选择的
amazon-redshift - 将压缩文件插入 RedShift
我在 s3 中有一个压缩文件。我想将它插入到 RedShift 数据库中。我的研究发现做到这一点的唯一方法是启动一个 ec2 实例。将文件移到那里，解压缩，然后将其发送回 S3。然后将其插入到我的 R
amazon-redshift - Redshift 和超宽表
为了在 Multi-Tenancy 维度 DW 中处理特定对象的自定义字段，我创建了 Redshift 不太喜欢的超宽非规范化维度表(数百列，列的硬编码限制)；)。 user1|attr1|attr2
amazon-redshift - Redshift 时间序列表加载问题
Redshift 文档将时间序列表确定为最佳实践: http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-time-series
amazon-redshift - Redshift 复制和自动增量不起作用
我正在使用 redshift 的 COPY 命令从 S3 复制 json 数据。表定义如下: CREATE TABLE my_raw ( id BIGINT IDENTITY(1,1), ... .
amazon-redshift - Redshift - 提取约束
如何获取导出的键(数据库元数据)。即使 redshift 不支持外键和主键，我也可以在系统表中看到它们。这里的问题是在系统表中，外键的多列作为数组存在于一列中(尽管redshift不支持数组)。是否可
amazon-redshift - Redshift 查询每日生成的表
我正在寻找一种创建 Redshift 查询的方法，该查询将从每天生成的表中检索数据。我们集群中的表具有以下形式: event_table_2016_06_14 event_table_2016_06_
amazon-redshift - 如何在 Redshift 的结果中保留列别名中的大写和小写字母
在 Redshift 中，当我们将结果导入 TABLEAU 时，我们试图为从查询返回的列提供更有意义的别名，问题是 RedShift 将所有字母转换为小写字母，即从“事件日期” ” 然后它返回“事件日
amazon-redshift - 实现 Redshift 的高可用性？
据我了解，Redshift 是为性能而不是可用性而构建的。文档 https://aws.amazon.com/redshift/faqs/建议一旦任何一个节点宕机，整个集群都会宕机，直到该节点恢复。在
amazon-redshift - 如何找出 redshift 中中止查询的原因？
我试图找出与中止查询相关的原因/错误，其中可以从 STL_query 表中找到中止的查询。我为此使用了 STL_errors，但发现错误上下文与 process id 相关，而不是特定的查询 id。有
amazon-redshift - AWS Redshift 是否支持副本？
我们正在使用 AWS Redshift DB 并希望创建一个在线复制(这样也可以完全更新更改)？原因是我们希望为我们的一个部门提供一个单独的环境来运行他们自己的查询，因为他们可能会“发疯”并做一些
amazon-redshift - 检索 Redshift 错误消息
我在使用 DataGrip 的 Redshift 集群上运行查询需要超过 10 个小时才能运行，不幸的是，这些查询经常失败。唉，DataGrip 与数据库的连接保持的时间不够长，我无法看到查询失败的错
amazon-redshift - 如何使用查询获取 redshift 中查询的总运行时间？
我正在对 redshift 中的一些查询进行基准测试，以便我可以对我对表所做的更改进行一些智能说明，例如添加编码和运行 vacuum。我可以查询stl_query带有 LIKE 子句的表来查找我感兴趣
amazon-redshift - 删除表后由 Redshift 回收磁盘空间
删除表后，redshift 是否回收可用磁盘空间，或者我们是否需要运行 vaccum。最佳答案 drop table 释放空间。如果您正在对表的行进行删除操作，那么您应该触发 vaccumm de
amazon-redshift - Amazon Redshift 中的加权移动平均线
有没有办法在 Amazon Redshift 中计算具有固定窗口大小的加权移动平均值？更详细地说，给定一个带有日期列和值列的表，对于每个日期计算指定大小窗口的加权平均值，并在辅助表中指定权重。到目前
amazon-redshift - 在 RedShift 中第一次执行查询时的运行时间长
我注意到第一次在 RedShift 上运行查询需要 3-10 秒。当我再次运行相同的查询时，即使在 WHERE 条件中使用不同的参数，它也会运行得很快(0.2 秒)。我正在谈论的查询在一个约 1M
amazon-redshift - 我可以在 Redshift 中从一张表复制到另一张表吗
我明白 the COPY command非常有效地导入大量数据。但是使用 the INSERT command 将数据从一个表复制到另一个表是慢的。有没有更有效的方法将数据从一个表复制到另一个表？或者

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

amazon-redshift - aws 胶水作业如何在 Redshift 中上传多个表