gpt4 book ai didi

python - 将非常大的 RDF 三元组加载到 iGraph -> 快速顶点查找?

转载 作者:行者123 更新时间:2023-11-28 16:37:47 24 4
gpt4 key购买 nike

我需要将 DBPedia 图的一个子集加载到 iGraph 中,以便计算一些图统计信息(例如节点中心性,...)。我使用 Redlands libRDF python 库加载 DBPedia 三元组。每个节点都与一个 URI(唯一标识符)相关联。

我在将图表加载到 iGraph 时遇到了一些问题。这就是我所做的:

1) 阅读三行(主语、谓语、宾语)

2) 使用以下算法获取或创建一个顶点(带属性)

def add_or_find_vertex (self, g, uri):
try:
return g.vs.find(name=uri)
except (KeyError, ValueError):
g.add_vertex(name=uri)
return g.vs.find(name=uri)

subjVertex = self.add_or_find_vertex(self.g, subject)
objVertex = self.add_or_find_vertex(self.g, object)
self.g.add_edge(subjVertex, objVertex, uri=predicate)

问题是我的脚本很慢,我需要加载 25M 的三元组。每个节点都是唯一的,但在三重文件中多次被发现。因此,我需要在创建边缘之前执行查找。你能告诉我“查找”方法是否使用索引进行查找(哈希表,...)吗?顶点查找的复杂度是多少?你会怎么做?

非常感谢

最佳答案

已回答here .为了完整起见,我也在这里复制我的答案:

Vertex lookups are usually O(|V|) since vertex attributes are not indexed by default - except the name vertex attribute, which is indexed. However g.vs.find is using this index only if you do this: g.vs.find(url) but not if you do this: g.vs.find(name=url). This is sort of a bug as the index could be used in both cases. Also see yesterday's thread from the mailing list.

However, note that igraph's data structures are optimized for static graphs, so g.add_vertex (and I presume you also use g.add_edge) could also be a bottleneck. Internally, igraph uses an indexed edge list to store the graph and the index has to be re-built every time you mutate the graph, so it is much more efficient to do vertex and edge additions in batches where possible.

Since you already seem to have an iterator that yields the edges of your graph in (subject, predicate, object) form, maybe it's easier to use Graph.DictList to construct the graph at once because it also takes care of storing the vertex IDs in the name attribute, adding edges in batches where it makes sense, and also adding the predicate attribute from your triplets:

>>> g = Graph.DictList(vertices=None, edges=({"source": subject,
... "target": object, "predicate": predicate}
... for subject, predicate, object in your_iterator))

Graph.DictList processes 100000 pre-generated random triplets in 1.63 seconds on my machine so I guess that improves things a little bit.

关于python - 将非常大的 RDF 三元组加载到 iGraph -> 快速顶点查找?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23597658/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com