gpt4 book ai didi

neo4j - 查找最常用的不同术语集

转载 作者:行者123 更新时间:2023-12-02 00:08:14 25 4
gpt4 key购买 nike

想象一个由 URL 和用于描述它们的标签组成的图形数据库。由此,我们希望找到哪些标签组最常一起使用,并确定哪些 URL 属于每个已识别的组。

我尝试创建一个数据集来简化这个问题,如cypher:

CREATE (tech:Tag { name: "tech" }), (comp:Tag { name: "computers" }), (programming:Tag { name: "programming" }), (cat:Tag { name: "cats" }), (mice:Tag { name: "mice" }), (u1:Url { name: "http://u1.com" })-[:IS_ABOUT]->(tech), (u1)-[:IS_ABOUT]->(comp), (u1)-[:IS_ABOUT]->(mice), (u2:Url { name: "http://u2.com" })-[:IS_ABOUT]->(mice), (u2)-[:IS_ABOUT]->(cat), (u3:Url { name: "http://u3.com" })-[:IS_ABOUT]->(tech), (u3)-[:IS_ABOUT]->(programming), (u4:Url { name: "http://u4.com" })-[:IS_ABOUT]->(tech), (u4)-[:IS_ABOUT]->(mice), (u4)-[:IS_ABOUT]->(acc:Tag { name: "accessories" })

以此为引用( neo4j console example here ),我们可以查看它并直观地识别出最常用的标签是 techmice (查询这很简单)都引用 3 个 URL。最常用的标签对是 [tech, mouse],因为它(在本示例中)是 2 个 url(u4 和 u1)共享的唯一对。请务必注意,此标记对是匹配 URL 的子集,而不是两者的整个集合。任何网址都没有共享 3 个标签的组合。

如何编写cypher查询来识别哪些标签组合最常一起使用(成对或N个大小的组)?也许有更好的方法来构建这些数据,使分析变得更容易?或者这个问题不太适合图数据库?一直在努力解决这个问题,任何帮助或想法将不胜感激!

最佳答案

看起来像是组合学问题。

// The tags for each URL, sorted by ID
MATCH (U:Url)-[:IS_ABOUT]->(T:Tag)
WITH U, T ORDER BY id(T)
WITH U,
collect(distinct T) as TAGS

// Calc the number of combinations of tags for a node,
// independent of the order of tags
// Since the construction of the power in the cyper is not available,
// use the logarithm and exponent
//
WITH U, TAGS,
toInt(floor(exp(log(2) * size(TAGS)))) as numberOfCombinations

// Iterate through all combinations
UNWIND RANGE(0, numberOfCombinations) as combinationIndex
WITH U, TAGS, combinationIndex

// And check for each tag its presence in combination
// Bitwise operations are missing in the cypher,
// therefore, we use APOC
// https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_bitwise_operations
//
UNWIND RANGE(0, size(TAGS)-1) as tagIndex
WITH U, TAGS, combinationIndex, tagIndex,
toInt(ceil(exp(log(2) * tagIndex))) as pw2
call apoc.bitwise.op(combinationIndex, "&", pw2) YIELD value
WITH U, TAGS, combinationIndex, tagIndex,
value WHERE value > 0

// Get all combinations of tags for URL
WITH U, TAGS, combinationIndex,
collect(TAGS[tagIndex]) as combination

// Return all the possible combinations of tags, sorted by frequency of use
RETURN combination, count(combination) as freq, collect(U) as urls
ORDER BY freq DESC

我认为最好在打标签的时候用这个算法计算并存储标签组合。查询将是这样的:

MATCH (Comb:TagsCombination)<-[:IS_ABOUT]-(U:Url)
WITH Comb, collect(U) as urls, count(U) as freq
MATCH (Comb)-[:CONTAIN]->(T:Tag)
RETURN Comb, collect(T) as Tags, urls, freq ORDER BY freq DESC

关于neo4j - 查找最常用的不同术语集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39518602/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com