gpt4 book ai didi

sql - Postgres 分组并提取具有多个元素的组

转载 作者:行者123 更新时间:2023-11-29 13:30:31 25 4
gpt4 key购买 nike

我想按两个 unicode 字段(keyword_text 和 keyword_match_type)“分组”,并提取具有两个以上元素的组的所有列和所有行。

例如一行是:

keyword_text | keyword_norm | keyword_GAD_id| keyword_account  | keyword_MCC_id | keyword_campaign | keyword_campaign_GAD_id | keyword_ad_group | keyword_ad_group_GAD_id| keyword_destination_url | keyword_max_cpc | keyword_status | keyword_match_type | keyword_campaign_status | keyword_ad_group_status | db_id | created_at |
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
"lebanese home delivery jai", "lebanese home delivery jai", 61557127036, "IN [S_02] Cuisine", 7795189055, "IN-JAI[S[Cui_30_EN]: Lebanese", 301573516, "IN-JAI[S[Cui_30_EN|del_02|geo_01]_ex: (Lebanese) Lebanese home delivery Jaipur", 11043049036, http://www.bla.in/restaurants/index/cuisines/lebanese/city/jaipur, 480000, ENABLED, EXACT, PAUSED, PAUSED, 1, "2014-07-18 18:42:43"

创建表时:

CREATE TABLE adword_keywords
(
keyword_text character varying(1000) NOT NULL,
keyword_norm character varying(1000) NOT NULL,
"keyword_GAD_id" bigint NOT NULL,
keyword_account character varying NOT NULL,
"keyword_MCC_id" bigint NOT NULL,
keyword_campaign character varying NOT NULL,
"keyword_campaign_GAD_id" bigint NOT NULL,
keyword_ad_group character varying NOT NULL,
"keyword_ad_group_GAD_id" bigint NOT NULL,
keyword_destination_url character varying NOT NULL,
keyword_max_cpc double precision,
keyword_status keyword_status,
keyword_match_type match_type,
keyword_campaign_status keyword_c_status,
keyword_ad_group_status keyword_ag_status,
db_id bigserial NOT NULL,
created_at timestamp without time zone,
CONSTRAINT adword_keywords_pkey PRIMARY KEY (db_id)
)
WITH (
OIDS=FALSE
);

CREATE INDEX ix_adword_keywords_keyword_norm
ON adword_keywords
USING btree
(keyword_norm COLLATE pg_catalog."default");

我尝试了以下查询:

SELECT adword_keywords.*
FROM adword_keywords
JOIN (
SELECT adword_keywords.keyword_text AS keyword_text,adword_keywords.keyword_match_type AS keyword_match_type
FROM adword_keywords GROUP BY adword_keywords.keyword_text, adword_keywords.keyword_match_type
HAVING count(adword_keywords.db_id) > 1) AS anon_1
ON adword_keywords.keyword_text = anon_1.keyword_text AND adword_keywords.keyword_match_type = anon_1.keyword_match_type
WHERE adword_keywords.keyword_campaign_status = 'ENABLED' AND adword_keywords.keyword_ad_group_status = 'ENABLED' AND adword_keywords.keyword_status = 'ENABLED'

不幸的是,这会返回错误的结果。也意味着由一个元素组成的组(当摸索 ['keyword_text','match_type'] 时)!

有人知道这个查询有什么问题吗?

请注意,如果我从数据库中提取所有数据并使用以下查询将其放入 pandas 数据结构中:

SELECT * FROM adword_keywords  
WHERE adword_keywords.keyword_campaign_status = \'ENABLED\'
AND adword_keywords.keyword_ad_group_status = \'ENABLED\'
AND adword_keywords.keyword_status = \'ENABLED\'

我可以这样过滤我想要的组:

df.groupy(['keyword_text','match_type']).filter(lambda x: x.shape[0]>1) 

后一个过程返回正确的结果。

但是,出于性能和内存问题的原因,我想对 sql 查询执行相同的操作(数据集很大,无法完全加载到 RAM 中)。

编辑

基于 ypercube I 的 aswer,存在三个返回正确结果的备选查询。我收集了它们以供引用,它们的运行时间:第一个版本是最快的。

使用 EXISTS1 个循环,3 个中的最佳:每个循环 2.22 秒:

WITH cte AS
( SELECT *
FROM adword_keywords
WHERE keyword_campaign_status = 'ENABLED'
AND keyword_ad_group_status = 'ENABLED'
AND keyword_status = 'ENABLED'
)
SELECT a.*
FROM cte AS a
WHERE EXISTS
( SELECT *
FROM cte AS b
WHERE (b.keyword_text, b.keyword_match_type)
= (a.keyword_text, a.keyword_match_type)
AND b.db_id <> a.db_id
) ;

使用 PARTITION1 个循环,3 个中的最佳:每个循环 5.7 秒

WITH cte AS
( SELECT *,
COUNT(*) OVER (PARTITION BY keyword_text, keyword_match_type) AS cnt
FROM adword_keywords
WHERE (keyword_campaign_status, keyword_ad_group_status, keyword_status)
= ('ENABLED', 'ENABLED', 'ENABLED')
)
SELECT *
FROM cte
WHERE cnt >= 2 ;

使用 GROUP BY1 个循环,3 个中的最佳:每个循环 5.11 秒:

select ak.*
from
adword_keywords ak
inner join (
select keyword_text, keyword_match_type
from adword_keywords
where
keyword_campaign_status = 'ENABLED' AND
keyword_ad_group_status = 'ENABLED' AND
keyword_status = 'ENABLED'
group by keyword_text, keyword_match_type
having count(db_id) > 1
) an1 using (keyword_text, keyword_match_type)
where
keyword_campaign_status = 'ENABLED' AND
keyword_ad_group_status = 'ENABLED' AND
keyword_status = 'ENABLED'

最佳答案

您可以对此类查询使用 EXISTS - 因此根本没有 COUNT(!),只需检查是否至少存在另一行具有相同的 campaign_status 和 ad_group_status .检查主键是为了确保它是另一行:

WITH cte AS
( SELECT *
FROM adword_keywords
WHERE (keyword_campaign_status, keyword_ad_group_status, keyword_status)
= ('ENABLED', 'ENABLED', 'ENABLED')
)
SELECT a.*
FROM cte AS a
WHERE EXISTS
( SELECT *
FROM cte AS b
WHERE (b.keyword_text, b.keyword_match_type)
= (a.keyword_text, a.keyword_match_type)
AND b.db_id <> a.db_id
) ;

或窗口函数:

WITH cte AS
( SELECT *,
COUNT(*) OVER (PARTITION BY keyword_text, keyword_match_type) AS cnt
FROM adword_keywords
WHERE (keyword_campaign_status, keyword_ad_group_status, keyword_status)
= ('ENABLED', 'ENABLED', 'ENABLED')
)
SELECT *
FROM cte
WHERE cnt > 1 ;

您的查询无效,因为您仅在外部级别具有 ENABLED 条件。将它们添加到 inetranl(派生表)中应该会得到相同的结果:

SELECT ak.*
FROM
adword_keywords ak
JOIN
( SELECT keyword_text, keyword_match_type
FROM adword_keywords
WHERE (keyword_campaign_status, keyword_ad_group_status, keyword_status)
= ('ENABLED', 'ENABLED', 'ENABLED')
GROUP BY keyword_text, keyword_match_type
HAVING COUNT(*) > 1
) AS d
USING (keyword_text, keyword_match_type)
WHERE (ak.keyword_campaign_status, ak.keyword_ad_group_status, ak.keyword_status)
= ('ENABLED', 'ENABLED', 'ENABLED');

关于sql - Postgres 分组并提取具有多个元素的组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24831075/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com