gpt4 book ai didi

sql - 分析大图 - 检索簇并计算最强路径

转载 作者:行者123 更新时间:2023-12-02 05:18:44 25 4
gpt4 key购买 nike

我尝试使用广度优先算法来检索所选用户所属的整个连接集群,方法如下:

CREATE PROCEDURE Breadth_First (@StartNode varchar(50), @LinkStrength decimal(10,7) = 0.1, @EndNode varchar(50) = NULL)
AS
BEGIN
-- Automatically rollback the transaction if something goes wrong.
SET XACT_ABORT ON
BEGIN TRAN

-- Increase performance and do not intefere with the results.
SET NOCOUNT ON;

-- Create a temporary table for storing the discovered nodes as the algorithm runs
CREATE TABLE #Discovered
(
DiscoveredUser varchar(50) NOT NULL, -- The Node Id
Predecessor varchar(50) NULL, -- The node we came from to get to this node.
LinkStrength decimal(10,7) NULL, -- positive - from predecessor to DiscoveredUser, negative - from DiscoveredUser to predecessor
OrderDiscovered int -- The order in which the nodes were discovered.
)

-- Initially, only the start node is discovered.
INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
VALUES (@StartNode, NULL, NULL, 0)

-- Add all nodes that we can get to from the current set of nodes,
-- that are not already discovered. Run until no more nodes are discovered.
WHILE @@ROWCOUNT > 0
BEGIN
-- If we have found the node we were looking for, abort now.
IF @EndNode IS NOT NULL
IF EXISTS (SELECT TOP 1 1 FROM #Discovered WHERE DiscoveredUser = @EndNode)
BREAK

-- We need to group by ToNode and select one FromNode since multiple
-- edges could lead us to new same node, and we only want to insert it once.
INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
(SELECT mc.called_party, mc.calling_party, mc.link_strength, d.OrderDiscovered + 1
FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.calling_party
WHERE mc.called_party NOT IN (SELECT DiscoveredUser From #Discovered) AND mc.link_strength > @LinkStrength
UNION
SELECT mc.calling_party, mc.called_party, mc.link_strength * (-1), d.OrderDiscovered + 1
FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.called_party
WHERE mc.calling_party NOT IN (SELECT DiscoveredUser FROM #Discovered) AND mc.link_strength > @LinkStrength
)
END;

-- Select the results. We use a recursive common table expression to
-- get the full path from the start node to the current node.
WITH BacktraceCTE(Predecessor, DiscoveredUser, LinkStrength, OrderDiscovered, Path)
AS
(
-- Anchor/base member of the recursion, this selects the start node.
SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered,
CAST(n. DiscoveredUser AS varchar(MAX))
FROM #Discovered d JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
WHERE d. DiscoveredUser = @StartNode

UNION ALL

-- Recursive member, select all the nodes which have the previous
-- one as their predecessor. Concat the paths together.
SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered,
CAST(cte.Path + ',' + CAST(n. DiscoveredUser as varchar(30)) as varchar(MAX))
FROM #Discovered d JOIN BacktraceCTE cte ON d.Predecessor = cte. DiscoveredUser
JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
)

SELECT Predecessor, DiscoveredUser, LinkStrength, OrderDiscovered, Path FROM BacktraceCTE
WHERE DiscoveredUser = @EndNode OR @EndNode IS NULL -- This kind of where clause can potentially produce
ORDER BY OrderDiscovered -- a bad execution plan, but I use it for simplicity here.

DROP TABLE #Discovered
COMMIT TRAN
RETURN 0
END

我当前正在分析的图表(社交网络)有 28M 连接,并且在不忽略弱连接(使用 @LinkStrength 设置阈值)的情况下,执行运行了很长时间(到目前为止,我还没有得到任何结果,并将尝试让它运行过夜)。

无论如何,下一步是计算两个用户(大约有 300 万用户)之间的最短(最强)链接。我正在考虑使用 Djikstra 算法,但不确定是否可以在我当前使用的 PC(Quad CPU 2.66 GHz,4GB RAM)上分析此类网络,并且数据存储在 MS SQL Server 2008 数据库中。

总而言之,我希望获得以下问题的一些答案/建议:

  1. 是否可以执行与上面的查询一样复杂的查询描述图(28M 连接,3M用户)在所描述的机器上(2.66 GHz,4GB RAM)?
  2. 如果不可能的话,有什么办法吗?其他可能的方法执行时间可以减少(例如,创建带有部分内容的表结果)?
  3. 您有其他推荐吗检测簇的算法并计算最短路径所描述的图表?

谢谢!

最佳答案

首先,使用索引

其次,您需要减少内存需求。这意味着首先为 VARCHAR(50) 提供一个短别名,例如 4 字节而不是 50 字节的 int。这将使您的速度提高 10 倍。

declare @tmpPeople table(
ixPerson int identity primary key,
UserNodeID varchar(50),
unique(UserNodeID, ix) -- this creates an index
)
Insert @tmpPeople(UserNodeID) select UserNodeID from NormalPeopleTable
declare @relationships table(
ixParent int,
ixChild int,
unique(ixParent, ixChild),
unique(ixChild, ixParent)
)
insert @relationships(ixParent, ixChild)
select distinct p.ixPerson, c.ixPerson
from NormalRelationshipsTable nr
inner join @tmpPeople p on p.UserNodeID = nr.ParentUserNodeID
inner join @tmpPeople c on c.UserNodeID = nr.ChildUserNodeID

-- OK now got a copy of all relationships, but it is a fraction of the size
-- because we are using integers for the keys.
-- if we need to we can insert the reverse relationships too.

您需要编写一个执行您想要的操作的查询,而不是“通用”的查询。

如果你想找到两个节点之间的最短距离,可以通过从两端同时搜索来减少搜索时间。

declare @p1 table(
ix int identity primary key,
ixParent int,
ixChild int,
nDeep int,
-- Need indexes
unique(ixParent, ixChild, nDeep, ix),
unique(ixChild, ixParent, nDeep, ix)
)
-- That's now indexed both ways.
-- If you only need one, you can comment the other out.
-- define @p2 the same

insert @p1 (ixParent, ixChild, nDeep) select @ixUserFrom, @ixUserFrom, 0
insert @p2 ..... @ixUserTo, @ixUserTo, 0

-- Breadth first goes like this.
-- Just keep repeating it till you get as far as you want.
insert @p1 (ixParent, ixChild, nDeep)
select
p1.ixChild, r.ixChild, p1.nDeep+1
from @p1 p1 inner join @relationships r on r.ixParent = p1.ixChild
-- may want to exclude those already in the table
where not exists (
select 1 from @p1 p1
where p1.ixParent = p.ixChild and p1.ixChild = r.ixChild
)

对于“从 Alice 到 Bob 的距离”,您并行执行两次搜索,并在 Alice 的搜索包含 Bob 的搜索中也包含的任何人时停止。这还会将您的时间缩短 n^2 倍,其中 n 是平均连接数。

如果深度变得太多,不要忘记停止。

关于sql - 分析大图 - 检索簇并计算最强路径,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4124441/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com