gpt4 book ai didi

postgresql - 使用 postgres_fdw 加速包含多个自连接的 View

转载 作者:行者123 更新时间:2023-11-29 13:02:10 26 4
gpt4 key购买 nike

(警告道歉和骇客入侵...)

背景:

我有一个遗留应用程序,我想避免重写它的大量 SQL 代码。我正在尝试加快它执行的特定类型的非常昂贵的查询(即:容易实现的成果)。

它有一个由 transactions 表表示的财务交易分类帐。当插入新行时,触发函数(此处未显示)为给定实体结转新余额。

某些类型的交易模型外部性(如机上支付)通过使用“相关”交易标记新交易,以便应用程序可以将相关交易组合在一起。

\d transactions

Table "public.transactions"
Column | Type | Modifiers
---------------------+-----------+-----------
entityid | bigint | not null
transactionid | bigint | not null default nextval('tid_seq')
type | smallint | not null
status | smallint | not null
related | bigint |
amount | bigint | not null
abs_amount | bigint | not null
is_credit | boolean | not null
inserted | timestamp | not null default now()
description | text | not null
balance | bigint | not null

Indexes:
"transactions_pkey" PRIMARY KEY, btree (transactionid)
"transactions by entityid" btree (entityid)
"transactions by initial trans" btree ((COALESCE(related, transactionid)))

Foreign-key constraints:
"invalid related transaction!" FOREIGN KEY (related)
REFERENCES transactions(transactionid)

在我的测试数据集中,我有:

  • 总共大约 550 万行
  • 大约 370 万行没有“相关”交易
  • 大约 180 万行与“相关”交易
  • 大约 55k 个不同的实体 ID(客户)。

因此,大约 1/3 的交易行是与某个早期交易“相关”的更新。生产数据大约是 transactionid 的 25 倍,distinct entityid 的大约 8 倍,1/3 的比例用于交易更新。

该代码查询一个特别低效的 VIEW,它被定义为:

CREATE VIEW collapsed_transactions AS
SELECT t.entityid,
g.initial,
g.latest,
i.inserted AS created,
t.inserted AS updated,
t.type,
t.status,
t.amount,
t.abs_amount,
t.is_credit,
t.balance,
t.description
FROM ( SELECT
COALESCE(x.related, x.transactionid) AS initial,
max(x.transactionid) AS latest
FROM transactions x
GROUP BY COALESCE(x.related, x.transactionid)
) g
INNER JOIN transactions t ON t.transactionid = g.latest
INNER JOIN transactions i ON i.transactionid = g.initial;

典型的查询采用以下形式:

SELECT * FROM collapsed_transactions WHERE entityid = 204425;

如您所见,where entityid = 204425 子句不会用于约束 GROUP BY 子查询,因此所有 entitids ' 事务将被分组,导致 55,000 个更大的子查询结果集和愚蠢地更长的查询时间......在撰写本文时,所有这些都平均达到 40 行(本例中为 71 行)。

我无法进一步规范化 transactions 表(比如让 initial_transactionsupdated_transactions 表由 related) 无需重写数百个代码库的 SQL 查询,其中许多查询以不同的方式使用自连接语义。

洞察力:

我最初尝试使用 WINDOW 函数重写查询,但是当我看到 www_fdw 时遇到了各种各样的问题(另一个 SO 问题)。将其 WHERE 子句作为 GET/POST 参数传递给 HTTP,我对无需太多重组即可优化非常幼稚的查询的可能性感到非常感兴趣。

Postgresql 9.3 manual说:

F.31.4. Remote Query Optimization

postgres_fdw attempts to optimize remote queries to reduce the amount of data transferred from foreign servers. This is done by sending query WHERE clauses to the remote server for execution, and by not retrieving table columns that are not needed for the current query. To reduce the risk of misexecution of queries, WHERE clauses are not sent to the remote server unless they use only built-in data types, operators, and functions. Operators and functions in the clauses must be IMMUTABLE as well.

The query that is actually sent to the remote server for execution can be examined using EXPLAIN VERBOSE.

尝试:

所以我认为也许我可以将 GROUP-BY 放入一个 View 中,将该 View 视为一个外部表,并且优化器将通过 WHERE 子句传递到该外部表,从而产生更高效的查询...... .

CREATE VIEW foreign_transactions_grouped_by_initial_transaction AS 
SELECT
entityid,
COALESCE(t.related, t.transactionid) AS initial,
MAX(t.transactionid) AS latest
FROM transactions t
GROUP BY
t.entityid,
COALESCE(t.related, t.transactionid);

CREATE FOREIGN TABLE transactions_grouped_by_initial_transaction
(entityid bigint, initial bigint, latest bigint)
SERVER local_pg_server
OPTIONS (table_name 'foreign_transactions_grouped_by_initial_transaction');

EXPLAIN ANALYSE VERBOSE
SELECT
t.entityid,
g.initial,
g.latest,
i.inserted AS created,
t.inserted AS updated,
t.type,
t.status,
t.amount,
t.abs_amount,
t.is_credit,
t.balance,
t.description
FROM transactions_grouped_by_initial_transaction g
INNER JOIN transactions t on t.transactionid = g.latest
INNER JOIN transactions i on i.transactionid = g.initial
WHERE g.entityid = 204425;

效果非常好!

 Nested Loop  (cost=100.87..305.05 rows=10 width=116) 
(actual time=4.113..16.646 rows=71 loops=1)
Output: t.entityid, g.initial, g.latest, i.inserted,
t.inserted, t.type, t.status, t.amount, t.abs_amount,
t.balance, t.description
-> Nested Loop (cost=100.43..220.42 rows=10 width=108)
(actual time=4.017..10.725 rows=71 loops=1)
Output: g.initial, g.latest, t.entityid, t.inserted,
t.type, t.status, t.amount, t.abs_amount, t.is_credit,
t.balance, t.description
-> Foreign Scan on public.transactions_grouped_by_initial_transaction g
(cost=100.00..135.80 rows=10 width=16)
(actual time=3.914..4.694 rows=71 loops=1)
Output: g.entityid, g.initial, g.latest
Remote SQL:
SELECT initial, latest
FROM public.foreign_transactions_grouped_by_initial_transaction
WHERE ((entityid = 204425))
-> Index Scan using transactions_pkey on public.transactions t
(cost=0.43..8.45 rows=1 width=100)
(actual time=0.023..0.035 rows=1 loops=71)
Output: t.entityid, t.transactionid, t.type, t.status,
t.related, t.amount, t.abs_amount, t.is_credit,
t.inserted, t.description, t.balance
Index Cond: (t.transactionid = g.latest)
-> Index Scan using transactions_pkey on public.transactions i
(cost=0.43..8.45 rows=1 width=16)
(actual time=0.021..0.033 rows=1 loops=71)
Output: i.entityid, i.transactionid, i.type, i.status,
i.related, i.amount, i.abs_amount, i.is_credit,
i.inserted, i.description, i.balance
Index Cond: (i.transactionid = g.initial)
Total runtime: 20.363 ms

问题:

但是,当我尝试将其烘焙到 VIEW 中(有或没有另一层 postgres_fdw)时,查询优化器似乎没有通过 WHERE 子句:-(

CREATE view collapsed_transactions_fast AS
SELECT
t.entityid,
g.initial,
g.latest,
i.inserted AS created,
t.inserted AS updated,
t.type,
t.status,
t.amount,
t.abs_amount,
t.is_credit,
t.balance,
t.description
FROM transactions_grouped_by_initial_transaction g
INNER JOIN transactions t on t.transactionid = g.latest
INNER JOIN transactions i on i.transactionid = g.initial;

EXPLAIN ANALYSE VERBOSE
SELECT * FROM collapsed_transactions_fast WHERE entityid = 204425;

结果:

Nested Loop  (cost=534.97..621.88 rows=1 width=117) 
(actual time=104720.383..139307.940 rows=71 loops=1)
Output: t.entityid, g.initial, g.latest, i.inserted, t.inserted, t.type,
t.status, t.amount, t.abs_amount, t.is_credit, t.balance,
t.description
-> Hash Join (cost=534.53..613.66 rows=1 width=109)
(actual time=104720.308..139305.522 rows=71 loops=1)
Output: g.initial, g.latest, t.entityid, t.inserted, t.type,
t.status, t.amount, t.abs_amount, t.is_credit, t.balance,
t.description
Hash Cond: (g.latest = t.transactionid)
-> Foreign Scan on public.transactions_grouped_by_initial_transaction g
(cost=100.00..171.44 rows=2048 width=16)
(actual time=23288.569..108916.051 rows=3705600 loops=1)
Output: g.entityid, g.initial, g.latest
Remote SQL:
SELECT initial, latest
FROM public.foreign_transactions_grouped_by_initial_transaction
-> Hash (cost=432.76..432.76 rows=142 width=101)
(actual time=2.103..2.103 rows=106 loops=1)
Output:
t.entityid, t.inserted, t.type, t.status, t.amount,
t.abs_amount, t.is_credit, t.balance, t.description,
t.transactionid
Buckets: 1024 Batches: 1 Memory Usage: 14kB
-> Index Scan using "transactions by entityid"
on public.transactions t
(cost=0.43..432.76 rows=142 width=101)
(actual time=0.049..1.241 rows=106 loops=1)
Output: t.entityid, t.inserted, t.type, t.status,
t.amount, t.abs_amount, t.is_credit,
t.balance, t.description, t.transactionid
Index Cond: (t.entityid = 204425)
-> Index Scan using transactions_pkey on public.transactions i
(cost=0.43..8.20 rows=1 width=16)
(actual time=0.013..0.018 rows=1 loops=71)
Output: i.entityid, i.transactionid, i.type, i.status, i.related,
i.amount, i.abs_amount, i.is_credit, i.inserted, i.description,
i.balance
Index Cond: (i.transactionid = g.initial)
Total runtime: 139575.140 ms

如果我可以将该行为嵌入到 VIEW 或 FDW 中,那么我只需在极少数查询中替换 VIEW 的 name 即可使其更加高效。我不在乎它对于其他一些用例(更复杂的 WHERE 子句)是否超慢,我将命名 VIEW 以反射(reflect)其预期用途。

use_remote_estimate 的默认值为 FALSE,但这两种方式都没有区别。

问题:

我可以使用一些技巧来使这个公认的 hack 工作吗?

最佳答案

如果我没有正确理解您的问题,答案是“否”。没有任何“技巧”可以让额外的 where 子句通过 fdw 包装器传递。

但是,我认为您优化的可能是错误的。

我将替换整个 collapsed_transactions View 。除非我遗漏了什么,否则它只取决于交易表。创建一个表,使用触发器对其进行更新,并且只向普通用户授予 SELECT 权限。从 pgtap 获取一些测试工具如果您还没有,那么您可以开始了。


编辑: View 优化。

如果您只想针对 View 优化该查询,并且可以调整 View 的定义,请尝试以下操作:

CREATE VIEW collapsed_transactions AS
SELECT
g.entityid, -- THIS HERE
g.initial,
g.latest,
i.inserted AS created,
t.inserted AS updated,
t.type,
t.status,
t.amount,
t.abs_amount,
t.is_credit,
t.balance,
t.description
FROM (
SELECT
entityid, -- THIS HERE
COALESCE(x.related, x.transactionid) AS initial,
max(x.transactionid) AS latest
FROM transactions x
GROUP BY entityid, COALESCE(x.related, x.transactionid)
) g
INNER JOIN transactions t ON t.transactionid = g.latest
INNER JOIN transactions i ON i.transactionid = g.initial;

请注意,子查询公开了 entityid 并允许我们对其进行过滤。我假设 entityid 对于主要项目和相关项目是不变的,否则我看不到查询如何工作。这应该让计划者对问题有足够的把握,以便首先在 entityid 上使用索引并将查询时间缩短到毫秒。

关于postgresql - 使用 postgres_fdw 加速包含多个自连接的 View ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26501327/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com