gpt4 book ai didi

hadoop - 如何减少基于 View 的查询处理时间?

转载 作者:可可西里 更新时间:2023-11-01 15:20:01 24 4
gpt4 key购买 nike

-我们有一个建立在许多 View 之上的主视图-几乎所有相关的集市数据都在这个 View 中,并且大多数作业都从这个 View 中提取数据。

-对该 View 的简单选择转换为 43 个映射和 37 个缩减器大约需要 1 小时,这会给集群带来很大的负载

以下是我尝试过的:

set hive.vectorized.execution.reduce.groupby.enabled=true;
set hive.exec.orc.split.strategy=BI;
set hive.merge.tezfiles=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
set hive.exec.compress.intermediate = true;

这里是查询:

CREATE VIEW `mstr_pub.vw_customer_info` AS 
SELECT
field 1, field2, field 3 .....
SUM(`PTS`.`current_balance`) OVER (PARTITION BY `voucher_cust`.`rel_child`) AS `HPTS`,
FLOOR(SUM(`PTS`.`c_balance`) OVER (PARTITION BY `voucher_cust`.`relationships_child`) / 10000)*1000 AS `HPTS_VAL`,
CAST(IF(UPPER(TRIM(`table1`.`pcplus`)) = 'ACKN', TRUE, FALSE) AS BOOLEAN) AS `NO_PC_PLUS`,

GET_JSON_OBJECT(`VIRTUAL`.`json`, '$.FIRST_EARN_DATE') AS `FIRST_EARN_DATE`,
... many such GET_JSON_OBJECT calculation fields...

FROM
`mstr_work`.`vw_voucher_cust_LATEST` `table1`
LEFT OUTER JOIN
`mstr_work`.`table2` `table2`
ON `table1`.`voucher_id` = `table2`.`voucher_id`
LEFT OUTER JOIN
`mstr_work`.`table3` `table3`
ON `table2`.`cust_id` = `table3`.`cust_id`
LEFT OUTER JOIN
(
SELECT
`table4`.`voucher_id`,
SUM(`table4`.`current_balance`) AS `CURRENT_BALANCE`
FROM
`mstr_work`.`table4`
WHERE
`table4`.`account_status` = 'ACTIVE'
GROUP BY
`table4`.`voucher_id`
)
`PTS`
ON `table1`.`voucher_ID` = `PTS`.`voucher_id`
LEFT OUTER JOIN
(
SELECT
`t`.`voucher_id`,
`t`.`capture_source`
FROM
(
SELECT
`table5`.`voucher_id`,
`table5`.`status_capture_source`,
ROW_NUMBER() OVER (PARTITION BY `table5`.`cust_id`
ORDER BY
`table5`.`status_capture_datetime` DESC) AS `RANK_`
FROM
`MSTR_CORE`.`table5`
WHERE
`table5`.`marketing_status` = 'COMMUNICATE'
AND `table5`.`capture_datetime` IS NOT NULL
)
`T`
WHERE
`t`.`rank_` = 1
)
`table6`
ON `table1`.`voucher_ID` = `table6`.`voucher_id`
LEFT OUTER JOIN
`mstr_work`.`table7` `table7`
ON `table2`.`cust_id` = `table7`.`cust_id`
AND `table7`.`notification_type` = 'SURVEYS'
AND `table2`.`business_effective_ts` = `table7`.`business_effective_ts`
LEFT OUTER JOIN
`mstr_work`.`table7` `table7`
ON `table2`.`cust_ID` = `table7`.`cust_id`
AND `table7`.`notification_type` = 'OFFERALERTS'
AND `table7`.`communication_type` = 'EMAIL'
AND `table2`.`BUSINESS_EFFECTIVE_TS` = `table7`.`business_effective_ts`
LEFT OUTER JOIN
`mstr_work`.`table8` `table8`
ON `table1`.`voucher_ID` = `table8`.`voucher_id`
AND `table8`.`rbc_customer` IS NOT NULL
LEFT OUTER JOIN
`mstr_work`.`table10` `table10`
ON `table2`.`cust_ID` = `table10`.`cust_id`
AND `table10`.`home_banner_id` = 'RETAIL'
LEFT OUTER JOIN
(
SELECT DISTINCT
`table9`.`zrks`,
`table9`.`zorg`,
`table9`.`zext`,
`table9`.`zweg`,
`table9`.`zrext`
FROM
`mstr_work`.`table9`
)
`table11`
ON `table10`.`store_id` = `table11`.`zrks`
LEFT OUTER JOIN
`mstr_work`.`table10` `table10`
ON `table2`.`cust_ID` = `table10`.`cust_id`
AND `table10`.`home_id` = 'SHOPPERS'
LEFT OUTER JOIN
`mstr_work`.`table11` `table12`
ON `table10`.`home_store_id` = `table12`.`org`
AND `table12`.`parent_type` = 'MKR'
AND `table12`.`relationship` = 'CONTAINS'
LEFT OUTER JOIN
`mstr_work`.`table13` as `table13`
ON `table1`.`relationships_child` = `table13`.`household_voucher_id`
AND UPPER(`table13`.`seg_attr`) = "SEG_ENTERPRISE"
LEFT OUTER JOIN
`mstr_work`.`table13` as `table13`
ON `table1`.`RELATIONSHIPS_CHILD` = `table13`.`household_voucher_id`
AND UPPER(`table13`.`seg_attr`) = "NATIONAL_PURCHASE"
LEFT OUTER JOIN
`mstr_work`.`table13` as `table13`
ON `table1`.`RELATIONSHIPS_CHILD` = `table13`.`household_voucher_id`
AND UPPER(`table13`.`seg_attr`) = "VALUE_SEG_SDM"
LEFT OUTER JOIN
(
SELECT
`table14`.`voucher_id`,
concat('{', concat_ws(',', collect_set(concat('"', `table14`.`attribute_name`, '":"', `table14`.`attribute_value`, '"'))), '}') `JSON`
FROM
`MSTR_CORE`.`table14`
GROUP BY
`table14`.`voucher_id`
)
`VIRTUAL`
ON `table1`.`voucher_ID` = `VIRTUAL`.`voucher_id`
WHERE
`table2`.`email_address` LIKE '%@%'
OR `table1`.`voucher_status` = "DELETED"

要求:

关于这个查询,我们能做些什么吗?我唯一能想到的是 - 简化在“where 子句”之前的最后一个表中设置的 json,并避免在“select”解析中进行 json 解析,以帮助减少几个 reducer。我看到相同的 table1,即 table1 在连接中多次使用。我怎样才能减少和合并连接或简化。有没有这样的选择

任何指针都会有很大的帮助

最佳答案

使用这样的方式对同一个表进行多个连接:

  LEFT OUTER JOIN
`mstr_work`.`table13` as `table13`
ON `table1`.`relationships_child` = `table13`.`household_voucher_id`
AND UPPER(`table13`.`seg_attr`) = "SEG_ENTERPRISE"
LEFT OUTER JOIN
`mstr_work`.`table13` as `table13`
ON `table1`.`RELATIONSHIPS_CHILD` = `table13`.`household_voucher_id`
AND UPPER(`table13`.`seg_attr`) = "NATIONAL_PURCHASE"
LEFT OUTER JOIN
`mstr_work`.`table13` as `table13`
ON `table1`.`RELATIONSHIPS_CHILD` = `table13`.`household_voucher_id`
AND UPPER(`table13`.`seg_attr`) = "VALUE_SEG_SDM"

您可以使用单连接轻松摆脱对同一个表的多个连接:

LEFT OUTER JOIN
`mstr_work`.`table13` as `table13`
ON `table1`.`RELATIONSHIPS_CHILD` = `table13`.`household_voucher_id`
AND UPPER(`table13`.`seg_attr`) in ("SEG_ENTERPRISE","NATIONAL_PURCHASE","VALUE_SEG_SDM")

当从 table13 中选择列时,使用 max()sum() 等带有 case 语句的聚合:

max(case when UPPER(`table13`.`seg_attr`) = SEG_ENTERPRISE then <some column> end) end as SEG_ENTERPRISE_data

table7table10也是如此

关于hadoop - 如何减少基于 View 的查询处理时间?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55475322/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com