gpt4 book ai didi

ruby - 查找所有重复的非重叠子串和循环

转载 作者:塔克拉玛干 更新时间:2023-11-03 04:30:29 26 4
gpt4 key购买 nike

我手头有一个复杂的字符串操作问题。我有一个字符串,其中我将有循环,以及我需要识别和列出的重复。

'abcabcabcabcabcdkkabclilabcoabcdieabcdowabcdppabzabx'

以下是可能的模式->

Actual indexes not used

abc -> 0,3,6,9,12,15,17, ..... (occurence index for recurring string), 0,3,6,9 (unique_occurence index for recurring string, 12, 15, 17 disqualified as there abc was a part of longer repeating substring)

abcd -> 12, 15, 17 (occurence index for recurring string), 12, 15, 17 (unique occurence index for recurring string)

bcda -> 13, 16, 18.. (occurence index for recurring string), (unique occurence index for recurring string) as it is an overlap for the string abcd Hence it is something not required ab -> 0,3,6,9,12,15,17, 25, 27 ...(occurence index for recurring string), 25, 27(unique occurence index for recurring string). .....

我想找到所有唯一的重复出现/重复,即重复字符串的所有唯一、非重叠值。正如刚才提到的。并且输入字符串可能包含,

ALL cyclic patterns(abcabcabcdefdefdeflkjlkjlkj => abc, def, lkj 是循环中的重复,但是 bcabbcab 不是预期的,因为它们是误报的结果)或者

单独重复出现的模式(abcxabcdabcm => abc 是重复而不是循环,即它们不是相邻的)或者

两者的混合(abcabcabcabcabclkabcdokabcdhuabcd => abc是循环递归,abcd是非循环递归,我们需要找到both -> 只有abcd, abc 是重复的,不是bc, ab, bcda 等)

有人可以为这个问题陈述提出一个解决算法吗?我正在尝试使用 suffix_arrays,它也没有找到重叠的结果。

最佳答案

哈希被构造,其键由给定字符串的所有唯一子字符串组成,这些子字符串在字符串中至少出现两次(不重叠),对于每个键,值是字符串中所有偏移量的数组,其中值键(一个子串)开始。

代码

def recurring_substrings(str)
arr = str.chars
(1..str.size/2).each_with_object({}) do |n,h|
arr.each_cons(n).map { |b| b.join }.uniq.each do |s|
str.scan(Regexp.new(s)) { (h[s] ||= []) << Regexp.last_match.begin(0) }
end
end.reject { |_,v| v.size == 1 }
end

示例

recurring_substrings 'abjkabrjkab'
#=> {"a"=>[0, 4, 9], "b"=>[1, 5, 10], "j"=>[2, 7], "k"=>[3, 8], "ab"=>[0, 4, 9],
# "jk"=>[2, 7], "ka"=>[3, 8], "jka"=>[2, 7], "kab"=>[3, 8], "jkab"=>[2, 7]}

recurring_substrings "abcabcabcabcabcdkkabclilabcoabcdieabcdowabcdppabzabx"
#=> {"a"=>[0, 3, 6, 9, 12, 18, 24, 28, 34, 40, 46, 49],
# "b"=>[1, 4, 7, 10, 13, 19, 25, 29, 35, 41, 47, 50],
# "c"=>[2, 5, 8, 11, 14, 20, 26, 30, 36, 42], "d"=>[15, 31, 37, 43],
# "k"=>[16, 17], "l"=>[21, 23], "i"=>[22, 32], "o"=>[27, 38], "p"=>[44, 45],
# "ab"=>[0, 3, 6, 9, 12, 18, 24, 28, 34, 40, 46, 49],
# "bc"=>[1, 4, 7, 10, 13, 19, 25, 29, 35, 41], "ca"=>[2, 5, 8, 11],
# "cd"=>[14, 30, 36, 42],
# "abc"=>[0, 3, 6, 9, 12, 18, 24, 28, 34, 40], "bca"=>[1, 4, 7, 10],
# "cab"=>[2, 5, 8, 11], "bcd"=>[13, 29, 35, 41],
# "abca"=>[0, 6], "bcab"=>[1, 7], "cabc"=>[2, 8], "abcd"=>[12, 28, 34, 40],
# "abcab"=>[0, 6], "bcabc"=>[1, 7], "cabca"=>[2, 8],
# "abcabc"=>[0, 6], "bcabca"=>[1, 7], "cabcab"=>[2, 8]}

解释

对于上面的第一个例子,步骤如下。

str = 'abjkabrjkab'
arr = str.chars
#=> ["a", "b", "j", "k", "a", "b", "r", "j", "k", "a", "b"]
q = str.size/2 # max size for string to repeat at least once
#=> 5
b = (1..q).each_with_object({})
#=> #<Enumerator: 1..5:each_with_object({})>

我们可以通过将其转换为数组来查看此枚举器将生成​​哪些元素。 (我将在下面多做几次。)

b.to_a
#=> [[1, {}], [2, {}], [3, {}], [4, {}], [5, {}]]

随着计算的进行,空的哈希将被建立起来。

接下来将第一个元素传递给 block ,并使用并行赋值(有时称为多重赋值)将 block 变量设置给它。

n,h = b.next
#=> [1, {}]
n #=> 1
h #=> {}

c = arr.each_cons(n)
#=> #<Enumerator: ["a", "b", "j", "k", "a", "b", "r", "j", "k", "a", "b"]:each_cons(1)>

c是所有长度为 1 的子串的数组。在下一次迭代中,它将是所有长度为 2 的子串的数组,依此类推。参见 Emumerable#each_cons .

c.to_a # Let's see which elements will be generated.
#=> [["a"], ["b"], ["j"], ["k"], ["a"], ["b"], ["r"], ["j"], ["k"], ["a"], ["b"]]
d = c.map { |b| b.join }
#=> ["a", "b", "j", "k", "a", "b", "r", "j", "k", "a", "b"]
e = d.uniq
#=> ["a", "b", "j", "k", "r"]

在下一次迭代中这将是

r = arr.each_cons(2)
#=> #<Enumerator: ["a", "b", "j", "k", "a", "b", "r", "j", "k", "a", "b"]:
# each_cons(2)>
r.to_a
#=> [["a", "b"], ["b", "j"], ["j", "k"], ["k", "a"], ["a", "b"],
# ["b", "r"], ["r", "j"], ["j", "k"], ["k", "a"], ["a", "b"]]
s = r.map { |b| b.join }
#=> ["ab", "bj", "jk", "ka", "ab", "br", "rj", "jk", "ka", "ab"]
s.uniq
#=> ["ab", "bj", "jk", "ka", "br", "rj"]

继续,

f = e.each
#=> #<Enumerator: ["a", "b", "j", "k", "r"]:each>
f.to_a # Let's see which elements will be generated.
#=> ["a", "b", "j", "k", "r"]

s = f.next
#=> "a"
r = (Regexp.new(s))
#=> /a/
str.scan(r) { (h[s] ||= []) << Regexp.last_match.begin(0) }

如果h还没有 key s , h[s] #=> nil . h[s] ||= [] ,扩展为 h[s] = h[s] || [] , 转换 h[s]在执行 h[s] << Regexp.last_match.begin(0) 之前到一个空数组.即 h[s] = h[s] || [] #=> nil || [] #=> [] .

在 block 内 MatchData使用类方法检索对象 Regexp::last_match . (或者,可以用全局变量 $~ 代替 Regexp.last_match。有关详细信息,请在 Regexp 搜索“特殊全局变量”。) MatchData#begin返回 str 的索引当前匹配开始的位置。

现在

h #=> {"a"=>[0, 4, 9]} 

其余计算类似,将键值对加入h直到示例中给出的 has 被构建。

关于ruby - 查找所有重复的非重叠子串和循环,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40301970/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com