PostgreSQL poor performance in LEFT JOIN + IS NOT DISTINCT FROM(PostgreSQL在LEFT JOIN+中的低性能与)-6ren

PostgreSQL poor performance in LEFT JOIN + IS NOT DISTINCT FROM(PostgreSQL在LEFT JOIN+中的低性能与)

翻译作者：bug小助手更新时间：2023-10-26 21:48:12

I'm dealing with JOIN two tables as follows:

我正在处理连接两个表，如下所示：

 SELECT count(*) FROM (
    SELECT * 
    FROM (SELECT * FROM lines_stats_process limit 800) lsp
    LEFT JOIN lines l ON 

    (lsp.id_data_source_connection = l.id_data_source_connection) AND 
    (lsp.id_sys_country IS NOT DISTINCT FROM l.id_sys_country) AND
    (lsp.advertiser_name IS NOT DISTINCT FROM l.advertiser_name) AND 
    (lsp.size_name IS NOT DISTINCT FROM l.size_name) AND 
    (lsp.dsp IS NOT DISTINCT FROM l.dsp) AND 

    (lsp.id_advertiser IS NOT DISTINCT FROM l.id_advertiser) AND 
    (lsp.id_placement IS NOT DISTINCT FROM l.id_placement) AND
    (lsp.id_site IS NOT DISTINCT FROM l.id_site) AND
    (lsp.id_campaign IS NOT DISTINCT FROM l.id_campaign) 
    ) test;

NOTE: I'm using count and limit 800 just for testing purposes. Without limit, the query never ends to execute.

注意：我使用Count and Limit 800只是出于测试目的。如果没有限制，查询将永远不会结束执行。

My DDL:

我的朋友：

    CREATE TABLE public.lines
    (
      id integer NOT NULL DEFAULT nextval('lines_seq'::regclass),
      id_account integer NOT NULL,
      id_data_source integer NOT NULL,
      id_data_source_connection integer NOT NULL,
      id_client integer,
      id_advertiser bigint,
      advertiser_name character varying(200) DEFAULT NULL::character varying,
      id_campaign bigint,
      campaign_name character varying(200) DEFAULT NULL::character varying,
      id_group bigint,
      group_name character varying(200) DEFAULT NULL::character varying,
      id_ad bigint,
      ad_name character varying(200) DEFAULT NULL::character varying,
      id_site bigint,
      site_name character varying(200) DEFAULT NULL::character varying,
      id_placement bigint,
      placement_name character varying(200) DEFAULT NULL::character varying,
      id_format bigint,
      format_name character varying(200) DEFAULT NULL::character varying,
      id_size bigint,
      size_name character varying(200) DEFAULT NULL::character varying,
      created_at timestamp(0) without time zone DEFAULT ('now'::text)::date,
      updated_at timestamp(0) without time zone DEFAULT ('now'::text)::date,
      id_ds_advertiser integer,
      id_ds_campaign integer,
      id_ds_group integer,
      id_ds_ad integer,
      id_ds_site integer,
      id_ds_placement integer,
      id_ds_format integer,
      id_publisher integer,
      id_country integer,
      dsp character varying(45) DEFAULT NULL::character varying,
      country_name character varying(200) DEFAULT NULL::character varying,
      id_ds_country integer,
      id_sys_country integer,
      CONSTRAINT lines_pkey PRIMARY KEY (id),
      CONSTRAINT lines_id_check CHECK (id > 0)
    )
    WITH (
      OIDS=FALSE
    );
    ALTER TABLE public.lines
      OWNER TO postgres;
    GRANT ALL ON TABLE public.lines TO postgres;

    -- Index: public.data_source_connection_lines

    -- DROP INDEX public.data_source_connection_lines;

    CREATE INDEX data_source_connection_lines
      ON public.lines
      USING btree
      (id_data_source_connection);

    -- Index: public.id_data_source_connection211

    -- DROP INDEX public.id_data_source_connection211;

    CREATE INDEX id_data_source_connection211
      ON public.lines
      USING btree
      (id_data_source_connection);

    -- Index: public.lines_advertiser_name

    -- DROP INDEX public.lines_advertiser_name;

    CREATE INDEX lines_advertiser_name
      ON public.lines
      USING btree
      (advertiser_name COLLATE pg_catalog."default");

    -- Index: public.lines_countries

    -- DROP INDEX public.lines_countries;

    CREATE INDEX lines_countries
      ON public.lines
      USING btree
      (id_country);

    -- Index: public.lines_dsp

    -- DROP INDEX public.lines_dsp;

    CREATE INDEX lines_dsp
      ON public.lines
      USING btree
      (dsp COLLATE pg_catalog."default");

    -- Index: public.lines_id_advertiser

    -- DROP INDEX public.lines_id_advertiser;

    CREATE INDEX lines_id_advertiser
      ON public.lines
      USING btree
      (id_advertiser);

    -- Index: public.lines_id_campaign

    -- DROP INDEX public.lines_id_campaign;

    CREATE INDEX lines_id_campaign
      ON public.lines
      USING btree
      (id_campaign);

    -- Index: public.lines_id_data_source_connection

    -- DROP INDEX public.lines_id_data_source_connection;

    CREATE INDEX lines_id_data_source_connection
      ON public.lines
      USING btree
      (id_data_source_connection);

    -- Index: public.lines_id_placement

    -- DROP INDEX public.lines_id_placement;

    CREATE INDEX lines_id_placement
      ON public.lines
      USING btree
      (id_placement);

    -- Index: public.lines_id_site

    -- DROP INDEX public.lines_id_site;

    CREATE INDEX lines_id_site
      ON public.lines
      USING btree
      (id_site);

    -- Index: public.lines_lsp

    -- DROP INDEX public.lines_lsp;

    CREATE INDEX lines_lsp
      ON public.lines
      USING btree
      (id_data_source_connection, advertiser_name COLLATE pg_catalog."default", size_name COLLATE pg_catalog."default", dsp COLLATE pg_catalog."default", id_advertiser, id_placement, id_site, id_campaign, id_sys_country);

    -- Index: public.lines_lsp2

    -- DROP INDEX public.lines_lsp2;

    CREATE INDEX lines_lsp2
      ON public.lines
      USING btree
      (id_data_source_connection, id_country, advertiser_name COLLATE pg_catalog."default", size_name COLLATE pg_catalog."default", dsp COLLATE pg_catalog."default", id_advertiser, id_placement, id_site, id_campaign);

    -- Index: public.lines_size_name

    -- DROP INDEX public.lines_size_name;

    CREATE INDEX lines_size_name
      ON public.lines
      USING btree
      (size_name COLLATE pg_catalog."default");











-- Table: public.lines_stats_process

-- DROP TABLE public.lines_stats_process;

CREATE TABLE public.lines_stats_process
(
  id integer NOT NULL DEFAULT nextval('lines_stats_process_seq'::regclass),
  id_line integer,
  id_line_stat integer,
  id_account integer NOT NULL,
  id_data_source integer NOT NULL,
  id_data_source_connection integer NOT NULL,
  id_client integer,
  id_advertiser bigint,
  advertiser_name character varying(200) DEFAULT NULL::character varying,
  id_campaign bigint,
  campaign_name character varying(200) DEFAULT NULL::character varying,
  id_group bigint,
  group_name character varying(200) DEFAULT NULL::character varying,
  id_ad bigint,
  ad_name character varying(200) DEFAULT NULL::character varying,
  id_site bigint,
  site_name character varying(200) DEFAULT NULL::character varying,
  id_placement bigint,
  placement_name character varying(200) DEFAULT NULL::character varying,
  id_format bigint,
  format_name character varying(200) DEFAULT NULL::character varying,
  id_size bigint,
  size_name character varying(200) DEFAULT NULL::character varying,
  created_at timestamp(0) without time zone DEFAULT ('now'::text)::date,
  updated_at timestamp(0) without time zone DEFAULT ('now'::text)::date,
  id_ds_advertiser integer,
  id_ds_campaign integer,
  id_ds_group integer,
  id_ds_ad integer,
  id_ds_site integer,
  id_ds_placement integer,
  id_ds_format integer,
  id_publisher integer,
  id_country integer,
  dsp character varying(45) DEFAULT NULL::character varying,
  date date NOT NULL,
  country_name character varying(50) DEFAULT NULL::character varying,
  id_sys_country integer,
  id_ds_country integer,
  budget numeric(10,2) DEFAULT NULL::numeric,
  bid numeric(10,2) DEFAULT NULL::numeric,
  CONSTRAINT lines_stats_process_pkey PRIMARY KEY (id),
  CONSTRAINT lines_stats_process_id_check CHECK (id > 0)
)
WITH (
  FILLFACTOR=70, 
  OIDS=FALSE
);
ALTER TABLE public.lines_stats_process
  OWNER TO postgres;
GRANT ALL ON TABLE public.lines_stats_process TO postgres;

-- Index: public.data_source_connection_lsp

-- DROP INDEX public.data_source_connection_lsp;

CREATE INDEX data_source_connection_lsp
  ON public.lines_stats_process
  USING btree
  (id_data_source_connection);

-- Index: public.id_data_source_connection21

-- DROP INDEX public.id_data_source_connection21;

CREATE INDEX id_data_source_connection21
  ON public.lines_stats_process
  USING btree
  (id_data_source_connection);

-- Index: public.lines_stats_process_advertiser_name

-- DROP INDEX public.lines_stats_process_advertiser_name;

CREATE INDEX lines_stats_process_advertiser_name
  ON public.lines_stats_process
  USING btree
  (advertiser_name COLLATE pg_catalog."default");

-- Index: public.lines_stats_process_countries

-- DROP INDEX public.lines_stats_process_countries;

CREATE INDEX lines_stats_process_countries
  ON public.lines_stats_process
  USING btree
  (id_sys_country);

-- Index: public.lines_stats_process_dsp

-- DROP INDEX public.lines_stats_process_dsp;

CREATE INDEX lines_stats_process_dsp
  ON public.lines_stats_process
  USING btree
  (dsp COLLATE pg_catalog."default");

-- Index: public.lines_stats_process_id_advertiser

-- DROP INDEX public.lines_stats_process_id_advertiser;

CREATE INDEX lines_stats_process_id_advertiser
  ON public.lines_stats_process
  USING btree
  (id_advertiser);

-- Index: public.lines_stats_process_id_campaign

-- DROP INDEX public.lines_stats_process_id_campaign;

CREATE INDEX lines_stats_process_id_campaign
  ON public.lines_stats_process
  USING btree
  (id_campaign);

-- Index: public.lines_stats_process_id_data_source_connection

-- DROP INDEX public.lines_stats_process_id_data_source_connection;

CREATE INDEX lines_stats_process_id_data_source_connection
  ON public.lines_stats_process
  USING btree
  (id_data_source_connection);

-- Index: public.lines_stats_process_id_placement

-- DROP INDEX public.lines_stats_process_id_placement;

CREATE INDEX lines_stats_process_id_placement
  ON public.lines_stats_process
  USING btree
  (id_placement);

-- Index: public.lines_stats_process_id_site

-- DROP INDEX public.lines_stats_process_id_site;

CREATE INDEX lines_stats_process_id_site
  ON public.lines_stats_process
  USING btree
  (id_site);

-- Index: public.lines_stats_process_lsp

-- DROP INDEX public.lines_stats_process_lsp;

CREATE INDEX lines_stats_process_lsp
  ON public.lines_stats_process
  USING btree
  (id_data_source_connection, advertiser_name COLLATE pg_catalog."default", size_name COLLATE pg_catalog."default", dsp COLLATE pg_catalog."default", id_advertiser, id_placement, id_site, id_campaign, id_sys_country);

-- Index: public.lines_stats_process_lsp2

-- DROP INDEX public.lines_stats_process_lsp2;

CREATE INDEX lines_stats_process_lsp2
  ON public.lines_stats_process
  USING btree
  (id_data_source_connection, id_sys_country, advertiser_name COLLATE pg_catalog."default", size_name COLLATE pg_catalog."default", dsp COLLATE pg_catalog."default", id_advertiser, id_placement, id_site, id_campaign);

-- Index: public.lines_stats_process_size_name

-- DROP INDEX public.lines_stats_process_size_name;

CREATE INDEX lines_stats_process_size_name
  ON public.lines_stats_process
  USING btree
  (size_name COLLATE pg_catalog."default");

This is the EXPLAIN ANALYZE

这是解释分析

The lines_stats_process has 60K rows. I'm limiting it to 800 in order to have a reasonable execution time.

LINES_STATS_PROCESS有60K行。为了有合理的执行时间，我将其限制在800。

I've made several test, I've disabled the following options:

我已经做了几次测试，我禁用了以下选项：

SET enable_nestloop =off;
SET enable_material = off;
SET enable_seqscan = off;

But never gets a better performance.

但从来没有得到更好的表现。

The only change that has some effect was to replace IS NOT DISTINCT FROM for =. After change that, the execution takes less than a second.
Of course, that operator isn't correct for my logic, so I need to use IS NOT DISTINCT FROM because I need null-safe comparison.
After made all the replaces, I started to put back IS NOT DISTINCT FROM one by one. When I put back the last IS NOT DISTINCT FROM the query takes too long to execute.

唯一有一些影响的更改是替换，这与for=没有区别。更改后，执行所需时间不到一秒。当然，该操作符对于我的逻辑是不正确的，所以我需要使用IS NOT DISTINCT FOR，因为我需要空安全比较。做完所有的更换后，我开始把没有区别的一件件放回去。当我放回最后一个不同于查询的时候，执行时间太长了。

I put the = in any other condition with the same effect. With 8 IS NOT DISTINCT FROM takes too long. With 7 IS NOT DISTINCT FROM it executes with less than a second.

我把=放在任何其他条件下都有同样的效果。用8个不明显要花的时间太长。与7没有区别，它的执行时间不到一秒。

Sorry for my english, is not my native language.

对不起，我的英语，不是我的母语。

更多回答

The link to the plan is not working (at least for me)

到计划的链接不起作用(至少对我来说)

IS NOT DISTINCT FROM can not use an index, see e.g. the discussion on the mailing list: postgresql.nabble.com/… the recommendation there is to rewrite that to ((a IS NULL AND b IS NULL) OR (a = b)) instead of a is not distinct from b - for that Postgres could use an index

与不能使用索引没有区别，参见邮件列表上的讨论：postgresql.nabble.com/…建议将其重写为((a为空，b为空)OR(a=b))，而不是a与b不同-因为postgres可以使用索引

Note: all your fields look like foreign keys into separate domains ("dimensions") Are they? what is their cardinality? Are your two tables fact tables ?

注意：您的所有字段看起来都像是进入单独的域(“维度”)的外键，是吗？它们的基数是多少？您的两个表是事实数据表吗？

Why do you have 5 indexes on id_data_source_connection? That is a total waste of space and time.

为什么您在id_data_source_Connection上有5个索引？这完全是在浪费空间和时间。

They are not duplicated; they appear twice in the lines table (probably), and you asked for them.

它们不是重复的；它们在LINES表中出现了两次(可能)，并且是您要求的。

优秀答案推荐

The problem is that IS NOT DISTINCT FROM is not indexable*, which leads to sequential scans, which becomes terribly slow horribly fast.

问题是没有区别于不是不可索引的*，这会导致顺序扫描，这会变得非常慢，非常快。

The short answer to your problem is: don't use NULL the way you're using it now. Joining on equality of NULL values is A Bad Idea™;

对您的问题的简短回答是：不要以您现在使用它的方式使用它。关于空值相等的连接是一个坏主意™

The somewhat longer answer is: You should think of NULL, semantically, as value "non-existent", so the implication is that it does not carry meaning other than the absence of meaning.

稍微长一点的答案是：在语义上，你应该认为空值是“不存在的”，所以它的含义是，它除了缺乏意义外，没有其他意义。

However, if you replace all IS NOT DISTINCT FROM with equality plus null check, you should be fine as well, meaning a IS DISTINCT FROM b should be written as (a=b OR (a IS NULL AND b IS NULL)).

然而，如果您用相等加NULL检查替换ALL IS NOT DISTINCT FOR，您也应该没有问题，这意味着a IS DISTINCT FOR应该写成(a=b OR(A IS NULL And B IS NULL))。

*) As a final aside, it appears an implementation detail that IS NOT DISTINCT FROM is not indexable. I'm not sure about the details, but someone might be able to elaborate.

*)作为最后的补充，与没有区别的实现细节似乎是不可索引的。我不确定细节，但也许有人能详细说明。

更多回答

文章推荐： How to use NodePort with kind?(如何将NodePort与Kind一起使用？)

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

PostgreSQL poor performance in LEFT JOIN + IS NOT DISTINCT FROM(PostgreSQL在LEFT JOIN+中的低性能与)