gpt4 book ai didi

string - 衡量两个字符串中出现的大小 >=2 的子序列数的指标

转载 作者:塔克拉玛干 更新时间:2023-11-03 03:41:10 24 4
gpt4 key购买 nike

假设我有 2 个长度相等的字符串。我正在寻找一个指标来衡量两个字符串在大小 >=2 的子序列数量方面有多少密切相关。

例如,假设 x="ABCDEF"y="DEFABC"。在这两个字符串中,子序列"ABC""DEF""AB""BC""DE""EF" 都是 size >=2 的内部子序列,出现在两个字符串中。

是否有任何指标或非朴素算法可以衡量这些模式在两个字符串中出现的次数?

最佳答案

您可以为此尝试使用后缀树/后缀数组

https://en.wikipedia.org/wiki/Suffix_array

https://en.wikipedia.org/wiki/Suffix_tree

对于在后缀数组实现的情况下问题中提供的示例,您将拥有

ABCDEF -> $  
ABCDEF$
BCDEF$
CDEF$
DEF$
EF$
F$

($ marks end of string)

后缀数组,它允许您使用二进制搜索 其他字符串的n-grams。让我们在后缀数组上测试 "DEFABC" 的 n-gram:

 "DE"   - found (DEF$), countinue to longer n-gram (3-gram)
"DEF" - found (DEF$), countinue to longer n-gram (4-gram)
"DEFA" - failed, continue to next 2-gram
"EF" - found (EF$), countinue to longer n-gram (3-gram)
"EFA" - failed, continue to next 2-gram
"FA" - failed, continue to next 2-gram
"AB" - found (ABCDEF$), countinue to longer n-gram (3-gram)
"ABC" - found (ABCDEF$), end of string reached, continue to next 2-gram
"BC" - found (BCDEF$), end of string reached
找到

6 个常见的 n-gram (n >= 2)。

到目前为止,对于长度为 N 的字符串,我们已经花费了 O(N*log(N)) 来创建后缀数组O(N*log(N)) 扫描 n-grams (O(N) 扫描,O (log(N)) 查看后缀数组); 后缀树 更难实现,只需要创建 O(N)

在一般情况下,当您检查大小为 MN 的字符串时,复杂度将为 O(M*log(M) + N*log( M)) 后缀数组O(M) + N*log(M) 后缀树 .

示例 C# 代码

  string left = "ABCDEF";
string right = "DEFABCD";
int n = 2;

// Very memory inefficient implementation, wants about N*N bytes per string
// just to demo
string[] array = Enumerable
.Range(0, left.Length) // all possible ...
.Select(index => left.Substring(index)) // ... suffixes which are ...
.OrderBy(suffix => suffix) // ... ordered as ...
.ToArray(); // ... array

int count = 0;

for (int i = 0; i <= right.Length - n; ++i) {
for (int length = n; length <= right.Length - i; ++length) {
string toFind = right.Substring(i, length);

int index = Array.BinarySearch(array, toFind);

if (index >= 0) { // exact match
if (toFind != array[index])
break;
}
else { // prefix match
int idx = ~index;

if (!(idx < left.Length && array[idx].StartsWith(toFind)) ||
(idx >= 1 && array[idx - 1].StartsWith(toFind)))
break;
}

count += 1;
}
}

Console.Write(count); // the output is "6"

关于string - 衡量两个字符串中出现的大小 >=2 的子序列数的指标,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41765004/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com