gpt4 book ai didi

database - 面对插入和删除更新 Aho-Corasick trie

转载 作者:搜寻专家 更新时间:2023-10-30 20:33:45 24 4
gpt4 key购买 nike

我发现的关于 Aho-Corasick 的所有文献和实现都是关于从一组短语预先构建整个 trie 的。但是,我对将其作为可变数据结构使用的方法很感兴趣,它可以处理偶尔的添加和删除,而无需重建整个 trie(假设其中有 100 万个条目)。如果最坏的情况很糟糕也没关系,只要平均情况接近对数。

根据我的理解,每个节点的失败状态是另一个使用相同符号的节点。因此,如果我们有一个从每个符号到使用该符号的节点列表的哈希多重映射,我们就有了失败状态需要更新的候选对象。

删除很简单。您找到使用已删除节点作为故障状态的所有其他节点并重新计算它们的故障状态。从字符串的末尾向后移动,树应该仍然处于良好状态。

添加有点棘手。任何未能达到该符号的节点都可以将新节点作为更好的候选节点。然而,再次似乎遍历具有该符号的所有其他节点并完全重新计算其失败状态。

换句话说,如果我们要添加或删除一个带有符号“A”的节点,我们需要访问 trie 中任何其他“A”节点,并重新计算失败状态(寻找其最近的祖先为“A"作为 child ,或根)。这将需要访问符号“A”的每个节点,但在我的例子中,这将比访问 trie 中的每个节点少几个数量级。

该算法是否有效,还是我遗漏了一些明显的东西?

最佳答案

我继续实现它并且它似乎正在工作。算法的更好描述,包括图片:https://burningmime.gitlab.io/setmatch/implementation-overview.html#supporting-add-remove-in-an-aho-corasick-trie

为了后代(以及根据 StackOverflow 政策),我已将其复制到此处:

It supports modification (add/remove) instead of being built all at once from a pre-set dictionary. This isn't mentioned in any literature about it, and all the open-source implementations I've seen of it have followed in that respect. It turns out that supporting add and remove operations to an AC automaton is fairly trivial. For deep tries over a small alphabet, it would not be very time-efficient, but luckily we have a shallow trie over a large alphabet.

We store a hash multimap of each token to every node that uses that token. When we remove a phrase, we start from the last (bottom-most) node. We remove the pointer to the phrase from this bottom-most node. Then, if it has any children, we can't delete any more. Otherwise, go through each other node that uses this node as a fail state and recompute its fail state. Finally, delete the node, then go up to the node's parent and do the same process. We'll eventually reach either a node that has another phrase output, a child, or the root node.

That's kind of difficult to picture, so consider this trie (stolen from the Wikipedia article). We're going to remove the string caa (light gray color), which will require that the fail states in yellow be updated:

trie-before

The result:

trie-after

Note there are cases where a node's fail state might be updated 2 or more times in the same remove operation, but will eventually be correct. This could be avoided by removing everything first, then going back through tokens, but the code is complicated enough as it is, and that would only provide a performance benefit in certain awkward circumstances, while being a performance hit in the average case. We just assume that strings containing the same token more than once are rare.

Adding a node is slightly more difficult, but can make use of the same hash multimap structure. Each time a node is added, every other node in the trie that uses the same token and is at a lower depth than the added node could need to have its fail state updated to the new node.

To help picture that, imagine going from the second diagram back to the first diagram. The same two fail states are the ones who would need to be updated if adding the string caa back into that trie, and since we recompute every c and a node, it's not a problem.

We could keep track of the node depths to skip over about half, but right now it's just recomputing the fail state for every node that shares the token to save memory. This means that tries where the tokens appear many times will be slow to add to, but again that's considered a rare enough situation. It would be a concern if you were trying to adapt this technique to something like a DNA sequence or antivirus database where there is a small alphabet.

The time complexity depends on how many nodes share symbols and how deep the trie is, meaning it's worst-case O(N), but in practice a fairly small number (average-case is roughly O(log^2 N) ).

难以置信的困惑和大部分未注释的代码,但如果你好奇,这里是 the header , the body , 和 some tests

关于database - 面对插入和删除更新 Aho-Corasick trie,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53288664/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com