gpt4 book ai didi

sql - 在单个查询中获取分页行和总数

转载 作者:行者123 更新时间:2023-12-02 07:57:53 25 4
gpt4 key购买 nike

核心要求:
通过person_id查找submission_date的最新条目,以指定过滤条件typeplanstatus。可能会有更多这样的过滤器,但是无论如何,按提交日期返回最新的逻辑都是相同的。有两种主要用途,一种用于在UI中进行分页查看,另一种用于生成报告。

WITH cte AS (
SELECT * FROM (
SELECT my_table.*, rank() OVER (PARTITION BY person_id ORDER BY submission_date DESC, last_updated DESC, id DESC) FROM my_table
) rank_filter
WHERE RANK=1 AND status in ('ACCEPTED','CORRECTED') AND type != 'CR' AND h_plan_id IN (10000, 20000)
)
SELECT
SELECT count(id) FROM cte group by id,
SELECT * FROM cte limit 10 offset 0;
group by在CTE上也不起作用。 count查询中所有 null的联合可能适用于合并,但不确定。

我想将这两个合并为1个查询的主要原因是因为表很大,而窗口函数却很昂贵。目前,我使用单独的查询,这两个查询基本上都对同一查询运行两次。

Postgres版本12。
\d my_table;
Table "public.my_table"
Column | Type | Collation | Nullable
--------------------------+-----------------------------+-----------+----------
id | bigint | | not null
h_plan_id | bigint | | not null
h_plan_submitter_id | bigint | |
last_updated | timestamp without time zone | |
date_created | timestamp without time zone | |
modified_by | character varying(255) | |
segment_number | integer | |

-- <bunch of other text columns>

submission_date | character varying(255) | |
person_id | character varying(255) | |
status | character varying(255) | |
file_id | bigint | | not null
Indexes:
"my_table_pkey" PRIMARY KEY, btree (id)
"my_table_file_idx" btree (file_id)
"my_table_hplansubmitter_idx" btree (h_plan_submitter_id)
"my_table_key_hash_idx" btree (key_hash)
"my_table_person_id_idx" btree (person_id)
"my_table_segment_number_idx" btree (segment_number)
Foreign-key constraints:
"fk38njesaryvhj7e3p4thqkq7pb" FOREIGN KEY (h_plan_id) REFERENCES health_plan(id) ON UPDATE CASCADE ON DELETE CASCADE
"fk6by9668sowmdob7433mi3rpsu" FOREIGN KEY (h_plan_submitter_id) REFERENCES h_plan_submitter(id) ON UPDATE CASCADE ON DELETE CASCADE
"fkb06gpo9ng6eujkhnes0eco7bj" FOREIGN KEY (file_id) REFERENCES x12file(id) ON UPDATE CASCADE ON DELETE CASCADE

附加信息 type的可能值为 ENCR,其中 EN约为数据的70%。
表列的宽度 select avg_width from pg_stats where tablename='mytable';对于41列总计374,因此每个列大约9。

想法是向用户显示一些页面,然后它们可以通过附加参数进行过滤,例如 file_name(每个文件通常具有约5k条目), type(非常低的基数), member_add_id(很高的基数), plan_id(很小的基数,每500k到一百万个条目将与一个计划ID相关联)。在所有情况下,业务需求都是仅显示 submission_date的一组计划ID的最新记录(对于每年完成的报告)。 id的顺序只是防御性编码,同一天可以有多个条目,即使有人编辑了倒数第二个条目,因此触摸了 last_updated时间戳,我们也只想显示同一数据的最后一个条目。这可能永远不会发生,可以删除。

用户可以使用此数据生成csv报告。

以下带有右联接的查询的解释结果:
 Nested Loop Left Join  (cost=554076.32..554076.56 rows=10 width=17092) (actual time=4530.914..4530.922 rows=10 loops=1)
CTE cte
-> Unique (cost=519813.11..522319.10 rows=495358 width=1922) (actual time=2719.093..3523.029 rows=422638 loops=1)
-> Sort (cost=519813.11..521066.10 rows=501198 width=1922) (actual time=2719.091..3301.622 rows=423211 loops=1)
Sort Key: mytable.person_id, mytable.submission_date DESC NULLS LAST, mytable.last_updated DESC NULLS LAST, mytable.id DESC
Sort Method: external merge Disk: 152384kB
-> Seq Scan on mytable (cost=0.00..54367.63 rows=501198 width=1922) (actual time=293.953..468.554 rows=423211 loops=1)
Filter: (((status)::text = ANY ('{ACCEPTED,CORRECTED}'::text[])) AND (h_plan_id = ANY ('{1,2}'::bigint[])) AND ((type)::text <> 'CR'::text))
Rows Removed by Filter: 10158
-> Aggregate (cost=11145.56..11145.57 rows=1 width=8) (actual time=4142.116..4142.116 rows=1 loops=1)
-> CTE Scan on cte (cost=0.00..9907.16 rows=495358 width=0) (actual time=2719.095..4071.481 rows=422638 loops=1)
-> Limit (cost=20611.67..20611.69 rows=10 width=17084) (actual time=388.777..388.781 rows=10 loops=1)
-> Sort (cost=20611.67..21850.06 rows=495358 width=17084) (actual time=388.776..388.777 rows=10 loops=1)
Sort Key: cte_1.person_id
Sort Method: top-N heapsort Memory: 30kB
-> CTE Scan on cte cte_1 (cost=0.00..9907.16 rows=495358 width=17084) (actual time=0.013..128.314 rows=422638 loops=1)
Planning Time: 0.369 ms
JIT:
Functions: 9
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.947 ms, Inlining 4.983 ms, Optimization 178.469 ms, Emission 110.261 ms, Total 295.660 ms
Execution Time: 4587.711 ms

最佳答案

首先,首先:您可以在同一查询中多次使用CTE的结果,这是主要的feature of CTEs。)您将像这样工作(虽然仍然仅使用CTE一次):

WITH cte AS (
SELECT * FROM (
SELECT *, row_number() -- see below
OVER (PARTITION BY person_id
ORDER BY submission_date DESC NULLS LAST -- see below
, last_updated DESC NULLS LAST -- see below
, id DESC) AS rn
FROM tbl
) sub
WHERE rn = 1
AND status IN ('ACCEPTED', 'CORRECTED')
)
SELECT *, count(*) OVER () AS total_rows_in_cte
FROM cte
LIMIT 10
OFFSET 0; -- see below

注意事项1: rank() rank()可以为每个 person_idrank = 1返回多行。 DISTINCT ON (person_id) (如提供的Gordon)是 row_number()的适用替代品-可以为您工作,其他信息已阐明。看到:
  • Select first row in each GROUP BY group?

  • 警告2: ORDER BY submission_date DESC submission_datelast_updated都没有定义 NOT NULLORDER BY submission_date DESC, last_updated DESC ...可能是一个问题,请参见:
  • PostgreSQL sort by datetime asc, null first?

  • 这些列真的应该是 NOT NULL吗?

    您回答:

    Yes, all those columns should be non-null. I can add that constraint. I put it as nullable since we get data in files which are not always perfect. But this is very rare condition and I can put in empty string instead.


    date类型不允许使用空字符串。保持列为空。对于这些情况, NULL是正确的值。如图所示,使用 NULLS LAST 来避免 NULL排在最前面。

    警告3: OFFSET
    如果 OFFSET等于或大于CTE返回的行数,则得到 没有行,因此也就没有总数。看到:
  • Run a query with a LIMIT/OFFSET and also get the total number of rows

  • 临时解决方案

    解决了到目前为止的所有警告,并根据添加的信息,我们可能会得出以下查询:
    WITH cte AS (
    SELECT DISTINCT ON (person_id) *
    FROM tbl
    WHERE status IN ('ACCEPTED', 'CORRECTED')
    ORDER BY person_id, submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC
    )
    SELECT *
    FROM (
    TABLE cte
    ORDER BY person_id -- ?? see below
    LIMIT 10
    OFFSET 0
    ) sub
    RIGHT JOIN (SELECT count(*) FROM cte) c(total_rows_in_cte) ON true;

    现在,CTE实际上已使用了两次。不论 RIGHT JOINOFFSET保证我们都能得到总数。对于基本查询中每个 DISTINCT ON仅有的几行, (person_id)应该执行OK-ish。

    但是您的行很宽。平均宽度是多少?该查询可能会导致对整个表进行顺序扫描。索引无济于事。所有这些都将使 对于分页而言仍然非常低效。看到:
  • Optimize query with OFFSET on large table

  • 您不能涉及用于分页的索引,因为它是基于CTE派生表的。而且您对分页的实际排序标准仍然不清楚( ORDER BY id吗?)。如果分页是目标,那么您急需一种不同的查询样式。如果仅对前几页感兴趣,则需要其他查询样式。最佳解决方案取决于问题中仍然缺少的信息...

    更快地

    为了您的更新目标:

    Find latest entries for a person_id by submission_date



    (为简单起见,忽略“指定的过滤条件,类型,计划,状态”。)

    和:

    Find the latest row per person_id only if that has status IN ('ACCEPTED','CORRECTED')



    基于这两个专门的 索引:
    CREATE INDEX ON tbl (submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST)
    WHERE status IN ('ACCEPTED', 'CORRECTED'); -- optional

    CREATE INDEX ON tbl (person_id, submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST);

    运行此查询:
    WITH RECURSIVE cte AS (
    (
    SELECT t -- whole row
    FROM tbl t
    WHERE status IN ('ACCEPTED', 'CORRECTED')
    AND NOT EXISTS (SELECT FROM tbl
    WHERE person_id = t.person_id
    AND ( submission_date, last_updated, id)
    > (t.submission_date, t.last_updated, t.id) -- row-wise comparison
    )
    ORDER BY submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST
    LIMIT 1
    )

    UNION ALL
    SELECT (SELECT t1 -- whole row
    FROM tbl t1
    WHERE ( t1.submission_date, t1.last_updated, t1.id)
    < ((t).submission_date,(t).last_updated,(t).id) -- row-wise comparison
    AND t1.status IN ('ACCEPTED', 'CORRECTED')
    AND NOT EXISTS (SELECT FROM tbl
    WHERE person_id = t1.person_id
    AND ( submission_date, last_updated, id)
    > (t1.submission_date, t1.last_updated, t1.id) -- row-wise comparison
    )
    ORDER BY submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST
    LIMIT 1)
    FROM cte c
    WHERE (t).id IS NOT NULL
    )
    SELECT (t).*
    FROM cte
    LIMIT 10
    OFFSET 0;

    此处需要每组括号。

    通过使用给定索引并且不进行顺序扫描,这种复杂程度应该可以从根本上更快地检索到相对较小的顶行集合。看到:
  • Optimize GROUP BY query to retrieve latest row per user
  • submission_date最有可能应该是 timestamptzdate类型,而不是 character varying(255)-在任何情况下,Postgres中的类型定义都是奇怪的。看到:
  • Refactor foreign key to fields

  • 可能会优化许多其他细节,但是这已经失控了。您可以考虑专业咨询。

    关于sql - 在单个查询中获取分页行和总数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61160156/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com