作者热门文章
- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我有一个ETL作业,我想将.csv文件中的数据附加到Impala表中。目前,我通过使用新数据(.csv.lzo格式)创建一个临时外部.csv表来完成此操作,然后将其插入到主表中。
我使用的查询如下所示:
INSERT INTO TABLE main_table
PARTITION(yr, mth)
SELECT
*,
CAST(extract(ts, "year") AS SMALLINT) AS yr,
CAST(extract(ts, "month") AS TINYINT) AS mth
FROM csv_table
main_table
的定义如下(几列被截断):
CREATE TABLE IF NOT EXISTS main_table (
tid INT,
s1 VARCHAR,
s2 VARCHAR,
status TINYINT,
ts TIMESTAMP,
n1 DOUBLE,
n2 DOUBLE,
p DECIMAL(3,2),
mins SMALLINT,
temp DOUBLE
)
PARTITIONED BY (yr SMALLINT, mth TINYINT)
STORED AS PARQUET
F01:PLAN FRAGMENT [HASH(CAST(extract(ts, 'year') AS SMALLINT),CAST(extract(ts, 'month') AS TINYINT))] hosts=2 instances=2
| Per-Host Resources: mem-estimate=1.01GB mem-reservation=12.00MB thread-reservation=1
WRITE TO HDFS [default.main_table, OVERWRITE=false, PARTITION-KEYS=(CAST(extract(ts, 'year') AS SMALLINT),CAST(extract(ts, 'month') AS TINYINT))]
| partitions=unavailable
| mem-estimate=1.00GB mem-reservation=0B thread-reservation=0
|
02:SORT
| order by: CAST(extract(ts, 'year') AS SMALLINT) ASC NULLS LAST, CAST(extract(ts, 'month') AS TINYINT) ASC NULLS LAST
| materialized: CAST(extract(ts, 'year') AS SMALLINT), CAST(extract(ts, 'month') AS TINYINT)
| mem-estimate=12.00MB mem-reservation=12.00MB spill-buffer=2.00MB thread-reservation=0
| tuple-ids=1 row-size=1.29KB cardinality=unavailable
| in pipelines: 02(GETNEXT), 00(OPEN)
|
01:EXCHANGE [HASH(CAST(extract(ts, 'year') AS SMALLINT),CAST(extract(ts, 'month') AS TINYINT))]
| mem-estimate=2.57MB mem-reservation=0B thread-reservation=0
| tuple-ids=0 row-size=1.28KB cardinality=unavailable
| in pipelines: 00(GETNEXT)
|
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
02:SORT 2 17m16s 30m50s 55.05M -1 25.60 GB 12.00 MB
01:EXCHANGE 2 9s493ms 12s822ms 55.05M -1 26.98 MB 2.90 MB HASH(CAST(extract(ts, 'year') AS SMALLINT),CAST(extract(ts, 'month') AS TINYINT))
00:SCAN HDFS 2 51s958ms 1m10s 55.05M -1 76.06 MB 704.00 MB default.csv_table
SELECT COUNT(*) FROM csv_table WHERE extract(ts, "year") = 2018 AND extract(ts, "month") = 1
这样的操作大约需要2-3分钟,而
ORDER BY
(在插入过程中完成)则需要一个多小时。此示例仅具有键(2018,1)和(2018,2)。
最佳答案
Impala会进行排序,因为您使用了动态分区。特别是对于具有oncomputed stats的表,impala在动态分区方面不太好。我建议您在动态分区的情况下使用配置单元。如果您不打算使用 hive ,我的建议是:
INSERT
INTO TABLE main_table
PARTITION(yr=2019, mth=2)
SELECT
*
FROM csv_table where CAST(extract(ts, "year") AS SMALLINT)=2019 and CAST(extract(ts, "month") AS TINYINT)=2;
INSERT INTO TABLE main_table
PARTITION(yr, mth)
SELECT
*,
CAST(extract(ts, "year") AS SMALLINT),
CAST(extract(ts, "month") AS TINYINT)
FROM csv_table where CAST(extract(ts, "year") AS SMALLINT)!=2019 and CAST(extract(ts, "month") AS TINYINT)!=2;
关于sql - 使用Impala在INSERT INTO(Parquet)TABLE期间对分区键进行排序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54783262/
我是一名优秀的程序员,十分优秀!