gpt4 book ai didi

postgresql - 在 PostgreSql 中查找大型数据集中最近邻居的最佳查询是什么?

转载 作者:行者123 更新时间:2023-11-29 12:23:31 25 4
gpt4 key购买 nike

我有一个巨大的表(大约 4000 万行),称为 nearest_spot,代表线(以线串格式)和它们最接近的点(大约有 1500 个不同的点,存储在另一个表中)。 nearest_spot 表是这样的:

 data_id || spot_id || spot_name || link_geom 

Where data_id the the primary key, spot_id is a foreign key to the primary key of the spot table, spot_name is the spot name (I know redundancy isn't good but I'm not allowed to modify the database) and link_geom is the line coordinates.


The database is in PostgreSQL 10.6, PostGIS 2.5, there is a gist index for the link_geom column, and a VACUUM ANALYZE has already been done on the nearest_spot table.

My goal is to find the nearest neighbor (in this table) to a point in a data record, as fast as possible.

I already know how to find the nearest neighbor, my problem is the time it takes to find it. I'm pretty new to PostgreSQL and PostGIS and I've been reading their documentations, going through a lot of topics about KNN optimizations, I've been searching for the most effective answer and yet I can't have a result under 5 minutes (and it goes up to 30min sometimes), even when only searching for one row . The different queries I've tried are as follows :

SELECT *
FROM( SELECT A.position, B.spot_id
FROM data A, nearest_spot B
WHERE A.id = 1
AND ST_DWithin(A.position,B.link_geom,20)
ORDER BY A.position <-> B.link_geom
LIMIT 10;)
ORDER BY ST_Distance(A.position,B.link_geom)
LIMIT 1;

SELECT *
FROM( SELECT A.position, B.spot_id
FROM data A, nearest_spot B
WHERE A.id = 1
AND ST_Buffer(A.position,20) && B.link_geom
ORDER BY A.position <-> B.link_geom
LIMIT 10;)
ORDER BY ST_Distance(A.position,B.link_geom)
LIMIT 1;

SELECT *
FROM( SELECT A.position, B.spot_id
FROM data A, nearest_spot B
WHERE A.id = 1
AND ST_Intersects(ST_Buffer(A.position,20), B.link_geom)
ORDER BY A.position <-> B.link_geom
LIMIT 10;)
ORDER BY ST_Distance(A.position,B.link_geom)
LIMIT 1;

我用 <-> ORDER BY 的原因首先,然后与 ST_Distance 是根据这个 documentation来自 PostGIS,<->速度更快但精度较低(对于边界框),而 ST_Distance 更精确但速度较慢。

我也用过这个documentation关于空间索引,这 one关于 <->运算符(operator),也都来自 PostGIS。

编辑:我意识到我的所有坐标都存储为几何图形 (SRID 4326),因此 ST_DWithin 调用虽然具有良好的语法,但它会返回不在 20 米范围内的所有行,正如它所想的那样,但是所有行都在(地球)20 度以内,所以实际上我的 ST_DWithin 并没有使结果集变小,这可能是它花费这么长时间的最大原因之一,ST_Buffer 也是如此。在将它们与米一起使用之前,我将尝试将所有坐标转换为地理坐标(使用 ::geography ),希望我会看到改进

最佳答案

表格似乎有大量重复项(每行重复了大约 1800 次),而给我的人根本不知道这件事。删除重复项后,查询时间不再有问题

关于postgresql - 在 PostgreSql 中查找大型数据集中最近邻居的最佳查询是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55865585/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com