gpt4 book ai didi

amazon-web-services - AWS Glue 书签

转载 作者:行者123 更新时间:2023-12-03 18:40:55 32 4
gpt4 key购买 nike

如何验证我的书签是否有效?我发现当我在上一个完成后立即运行一个作业时,它似乎仍然需要很长时间。这是为什么?我以为它不会读取它已经处理过的文件?脚本如下所示:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://xxx-glue/testing-csv"], "recurse": True}, format = "csv", format_options = {"withHeader": True}, transformation_ctx="inputGDF")

if bool(inputGDF.toDF().head(1)):
print("Writing ...")
inputGDF.toDF() \
.drop("createdat") \
.drop("updatedat") \
.write \
.mode("append") \
.partitionBy(["querydestinationplace", "querydatetime"]) \
.parquet("s3://xxx-glue/testing-parquet")
else:
print("Nothing to write ...")

job.commit()

import boto3
glue_client = boto3.client('glue', region_name='ap-southeast-1')
glue_client.start_crawler(Name='xxx-testing-partitioned')

看起来像:
18/12/11 14:49:03 INFO Client: Application report for application_1544537674695_0001 (state: RUNNING)
18/12/11 14:49:03 DEBUG Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 172.31.2.72
ApplicationMaster RPC port: 0
queue: default
start time: 1544539297014
final status: UNDEFINED
tracking URL: http://ip-172-31-0-204.ap-southeast-1.compute.internal:20888/proxy/application_1544537674695_0001/
user: root
18/12/11 14:49:04 INFO Client: Application report for application_1544537674695_0001 (state: RUNNING)
18/12/11 14:49:04 DEBUG Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 172.31.2.72
ApplicationMaster RPC port: 0
queue: default
start time: 1544539297014
final status: UNDEFINED
tracking URL: http://ip-172-31-0-204.ap-southeast-1.compute.internal:20888/proxy/application_1544537674695_0001/
user: root
18/12/11 14:49:05 INFO Client: Application report for application_1544537674695_0001 (state: RUNNING)
18/12/11 14:49:05 DEBUG Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 172.31.2.72
ApplicationMaster RPC port: 0
queue: default
start time: 1544539297014
final status: UNDEFINED
tracking URL: http://ip-172-31-0-204.ap-southeast-1.compute.internal:20888/proxy/application_1544537674695_0001/
user: root
...

18/12/11 14:42:00 INFO NewHadoopRDD: Input split: s3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-15_2018-11-19.csv:0+1194081
18/12/11 14:42:00 INFO S3NativeFileSystem: Opening 's3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-14_2018-11-18.csv' for reading
18/12/11 14:42:00 INFO S3NativeFileSystem: Opening 's3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-15_2018-11-19.csv' for reading
18/12/11 14:42:00 INFO Executor: Finished task 89.0 in stage 0.0 (TID 89). 2088 bytes result sent to driver
18/12/11 14:42:00 INFO CoarseGrainedExecutorBackend: Got assigned task 92
18/12/11 14:42:00 INFO Executor: Running task 92.0 in stage 0.0 (TID 92)
18/12/11 14:42:00 INFO NewHadoopRDD: Input split: s3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-16_2018-11-20.csv:0+1137753
18/12/11 14:42:00 INFO Executor: Finished task 88.0 in stage 0.0 (TID 88). 2088 bytes result sent to driver
18/12/11 14:42:00 INFO CoarseGrainedExecutorBackend: Got assigned task 93
18/12/11 14:42:00 INFO Executor: Running task 93.0 in stage 0.0 (TID 93)
18/12/11 14:42:00 INFO NewHadoopRDD: Input split: s3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-17_2018-11-21.csv:0+1346626
18/12/11 14:42:00 INFO S3NativeFileSystem: Opening 's3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-16_2018-11-20.csv' for reading
18/12/11 14:42:00 INFO S3NativeFileSystem: Opening 's3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-17_2018-11-21.csv' for reading
18/12/11 14:42:00 INFO Executor: Finished task 90.0 in stage 0.0 (TID 90). 2088 bytes result sent to driver
18/12/11 14:42:00 INFO Executor: Finished task 91.0 in stage 0.0 (TID 91). 2088 bytes result sent to driver
18/12/11 14:42:00 INFO CoarseGrainedExecutorBackend: Got assigned task 94
18/12/11 14:42:00 INFO CoarseGrainedExecutorBackend: Got assigned task 95
18/12/11 14:42:00 INFO Executor: Running task 95.0 in stage 0.0 (TID 95)
18/12/11 14:42:00 INFO Executor: Running task 94.0 in stage 0.0 (TID 94)

...我注意到 Parquet 附加了大量重复数据...书签不起作用吗?它已经启用

最佳答案

书签要求
来自 the docs

  • 作业必须使用 --job-bookmark-option 创建job-bookmark-enable (或者如果使用控制台,则在控制台选项中)。作业还必须有作业名称;这将自动传入。
  • 作业必须以 job.init(jobname) 开头
    例如
  • job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
  • 工作必须有 job.commit()保存书签状态并成功完成。
  • 数据源必须是 s3 源或 JDBC(有限,而不是您的用例,所以我将忽略它)。

  • example in the docs显示了使用表名而不是显式 S3 路径从(胶水/湖地层)目录创建动态框架。这意味着从目录中读取仍被视为 S3 源;底层文件将在 S3 上。
  • s3 上的文件必须是 JSON、CSV、Apache Avro、0.9 及以上版本的 XML 之一,或者可以是 Parquet 或 ORC 版本 1.0 及以上
  • 脚本中的数据源必须有 transformation_ctx范围。
    文档说

  • pass the transformation_ctx parameter only to those methods that youwant to enable bookmarksYou could add this to every transform for saving state but the critical one(s) are the datasource(s) you want to bookmark.


    故障排除
    来自 the docs
  • 最大并发数必须为 1。更高的值会破坏书签
  • 它还提到job.commit()并使用 transformation_ctx如上

  • For Amazon S3 input sources, job bookmarks check the last modifiedtime of the objects, rather than the file names, to verify whichobjects need to be reprocessed. If your input source data has beenmodified since your last job run, the files are reprocessed when yourun the job again.


    其他检查事项
  • 您是否已验证路径 "s3://xxx-glue/testing-csv" 中的 CSV 文件?不已经包含重复项?您可以使用 Glue 爬虫或在 Athena 中编写 DDL 在它们上面创建一个表并直接查看。或者,创建一个开发端点并运行 zeppelin 或 sagemaker notebook 并逐步执行您的代码。
  • 它没有提到编辑脚本会重置您的状态的任何地方,但是,如果您修改了 transformation_ctx数据源或其他阶段,那么这可能会影响状态,但是我还没有验证。作业有 Jobname其中键状态,以及用于管理重试和最新状态的运行号、尝试号和版本号,这意味着只要 Jobname,对脚本的微小更改就不会影响状态。是一致的,但我还没有验证过。
  • 顺便说一句,在您的代码中,您测试 inputGDF.toDF().head(1)然后运行 ​​inputGDF.toDF()...写入数据。 Spark 被懒惰地评估,但在这种情况下,您将两次运行与数据帧等效的动态帧,并且 Spark 无法缓存或重用它。最好做一些类似 df = inputGDF.toDF() 的事情之前if然后重用 df两次。
  • 关于amazon-web-services - AWS Glue 书签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53726787/

    32 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com