gpt4 book ai didi

aws-glue - AWS 胶水 : Crawler does not recognize Timestamp columns in CSV format

转载 作者:行者123 更新时间:2023-12-03 14:38:51 27 4
gpt4 key购买 nike

运行 AWS Glue 爬网程序时,它无法识别时间戳列。

我在 CSV 文件中正确格式化了 ISO8601 时间戳。首先,我希望 Glue 自动将这些归类为时间戳,但事实并非如此。

我还尝试了此链接中的自定义时间戳分类器 https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html

这是我的分类器的样子

grok classifier

这也没有正确分类我的时间戳。

例如,我已将我的数据放入 grok 调试器 ( https://grokdebug.herokuapp.com/ )

id,iso_8601_now,iso_8601_yesterday
0,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056
1,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056

它在两者上都匹配

%{TIMESTAMP_ISO8601:时间戳}

%{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?
import csv
from datetime import datetime, timedelta

with open("timestamp_test.csv", 'w', newline='') as f:
w = csv.writer(f, delimiter=',')

w.writerow(["id", "iso_8601_now", "iso_8601_yesterday"])

for i in range(1000):
w.writerow([i, datetime.utcnow().isoformat(), (datetime.utcnow() - timedelta(days=1)).isoformat()])

我希望 AWS 胶自动将 iso_8601 列分类为时间戳。即使在添加自定义 grok 分类器时,它仍然不会将任一列分类为时间戳。

两列都被归类为字符串。

分类器在爬虫上处于事件状态
classifier active

爬虫输出timestamp_test表
{
"StorageDescriptor": {
"cols": {
"FieldSchema": [
{
"name": "id",
"type": "bigint",
"comment": ""
},
{
"name": "iso_8601_now",
"type": "string",
"comment": ""
},
{
"name": "iso_8601_yesterday",
"type": "string",
"comment": ""
}
]
},
"location": "s3://REDACTED/_csv_timestamp_test/",
"inputFormat": "org.apache.hadoop.mapred.TextInputFormat",
"outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"compressed": "false",
"numBuckets": "-1",
"SerDeInfo": {
"name": "",
"serializationLib": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
"parameters": {
"field.delim": ","
}
},
"bucketCols": [],
"sortCols": [],
"parameters": {
"skip.header.line.count": "1",
"sizeKey": "58926",
"objectCount": "1",
"UPDATED_BY_CRAWLER": "REDACTED",
"CrawlerSchemaSerializerVersion": "1.0",
"recordCount": "1227",
"averageRecordSize": "48",
"CrawlerSchemaDeserializerVersion": "1.0",
"compressionType": "none",
"classification": "csv",
"columnsOrdered": "true",
"areColumnsQuoted": "false",
"delimiter": ",",
"typeOfData": "file"
},
"SkewedInfo": {},
"storedAsSubDirectories": "false"
},
"parameters": {
"skip.header.line.count": "1",
"sizeKey": "58926",
"objectCount": "1",
"UPDATED_BY_CRAWLER": "REDACTED",
"CrawlerSchemaSerializerVersion": "1.0",
"recordCount": "1227",
"averageRecordSize": "48",
"CrawlerSchemaDeserializerVersion": "1.0",
"compressionType": "none",
"classification": "csv",
"columnsOrdered": "true",
"areColumnsQuoted": "false",
"delimiter": ",",
"typeOfData": "file"
}
}

最佳答案

根据 CREATE TABLE doc,时间戳格式为yyyy-mm-dd hh:mm:ss[.f...]
如果必须使用 ISO8601 格式,请添加此 Serde 参数 'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS'
您可以从 Glue(1) 更改表或从 Athena(2) 重新创建它:

  • 粘合控制台 > 表格 > 编辑表格 > 将上述内容添加到 Serde 参数中。您还需要单击“编辑模式”并将数据类型从字符串更改为时间戳
  • 从 Athena 删除表并运行:
  • CREATE EXTERNAL TABLE `table1`(
    `id` bigint,
    `iso_8601_now` timestamp,
    `iso_8601_yesterday` timestamp)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
    WITH SERDEPROPERTIES (
    'field.delim' = ',',
    'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS')
    LOCATION
    's3://REDACTED/_csv_timestamp_test/'

    关于aws-glue - AWS 胶水 : Crawler does not recognize Timestamp columns in CSV format,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56177686/

    27 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com