gpt4 book ai didi

avro - Parquet Data 时间戳列 INT96 尚未在 Druid Overlord Hadoop 任务中实现

转载 作者:行者123 更新时间:2023-12-05 07:37:44 26 4
gpt4 key购买 nike

上下文:

我能够从 druid overlord 向 EMR 提交 MapReduce 作业。我的数据源是 Parquet 格式的 S3。我在 Parquet 数据中有一个时间戳列 (INT96),Avroschema 不支持它。

解析时间戳时出错

问题堆栈跟踪是:

Error: java.lang.IllegalArgumentException: INT96 not yet implemented.
at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279)
at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264)
at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:223)

环境:

Druid version: 0.11
EMR version : emr-5.11.0
Hadoop version: Amazon 2.7.3

德鲁伊输入json

{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "static",
"inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
"paths": "s3://s3_path"
}
},
"dataSchema": {
"dataSource": "parquet_test1",
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "ALL",
"intervals": ["2017-08-01T00:00:00/2017-08-02T00:00:00"]
},
"parser": {
"type": "parquet",
"parseSpec": {
"format": "timeAndDims",
"timestampSpec": {
"column": "t",
"format": "yyyy-MM-dd HH:mm:ss:SSS zzz"
},
"dimensionsSpec": {
"dimensions": [
"dim1","dim2","dim3"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [{
"type": "count",
"name": "count"
},{
"type" : "count",
"name" : "pid",
"fieldName" : "pid"
}]
},
"tuningConfig": {
"type": "hadoop",
"partitionsSpec": {
"targetPartitionSize": 5000000
},
"jobProperties" : {
"mapreduce.job.user.classpath.first": "true",
"fs.s3.awsAccessKeyId" : "KEYID",
"fs.s3.awsSecretAccessKey" : "AccessKey",
"fs.s3.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"fs.s3n.awsAccessKeyId" : "KEYID",
"fs.s3n.awsSecretAccessKey" : "AccessKey",
"fs.s3n.impl" : "org.apache.hadoop.fs.s3native.NativeS3FileSystem",
"io.compression.codecs" : "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec"
},
"leaveIntermediate": true
}
}, "hadoopDependencyCoordinates": ["org.apache.hadoop:hadoop-client:2.7.3", "org.apache.hadoop:hadoop-aws:2.7.3", "com.hadoop.gplcompression:hadoop-lzo:0.4.20"]
}

可能的解决方案

 1. Save the data in parquet efficiently instead of transforming in Avro to remove the dependencies.

2. Fixing AvroSchema to support INT96 timestamp format of Parquet.

最佳答案

0.17.0 及更高版本 中的 Druid 使用 Parquet Hadoop Parser 支持 Parquet INT96 类型。

The Parquet Hadoop Parser supports int96 Parquet values, while the Parquet Avro Hadoop Parser does not. There may also be some subtle differences in the behavior of JSON path expression evaluation of flattenSpec.

https://druid.apache.org/docs/0.17.0/ingestion/data-formats.html#parquet-hadoop-parser

关于avro - Parquet Data 时间戳列 INT96 尚未在 Druid Overlord Hadoop 任务中实现,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48366196/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com