python - 不切割单词的最长公共(public)子串- python-6ren

python - 不切割单词的最长公共(public)子串- python

转载作者：太空宇宙更新时间：2023-11-03 12:24:16

25

4

鉴于以下，我可以找到最长的公共(public)子串:

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

print longest_common_substring(s1, s2)

[输出]:

foo bar

但是如何确保最长公共(public)子串遵守英文单词边界并且不分割单词呢？例如下面的句子:

s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)

输出不所需的后续内容，因为它打断了 s2 中的单词 kappa:

a foo bar

期望的输出仍然是:

foo bar

我也尝试了一种 ngram 方法来获取关于单词边界的最长公共(public)子串，但是是否有其他方法可以在不计算 ngram 的情况下处理字符串？ (见答案)

最佳答案

这个太简单了，不好理解。我用你的代码完成了 75% 的工作。我先把句子拆分成单词，然后传给你的函数得到最大的公共(public)子串(在这种情况下就是最长的连续单词)，所以你的函数给了我 ['foo', 'bar'], 我加入了该数组的元素以产生所需的结果。

这是在线工作副本，供您测试验证和摆弄。

http://repl.it/RU0/1

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def longest_common_sentence(s1, s2):
    s1_words = s1.split(' ')
    s2_words = s2.split(' ')  
    return ' '.join(longest_common_substring(s1_words, s2_words))


s1 = 'this is a foo bar sentence .'
s2 = 'what a kappa foo bar black sheep ?'
common_sentence = longest_common_sentence(s1, s2)
print common_sentence
>> 'foo bar'

边缘情况

'.'和 '？'如果最后一个单词和标点符号之间有空格，也将被视为有效单词，就像您的情况一样。如果您不留空格，它们将被计为最后一个词的一部分。那么“绵羊”和“绵羊”呢？不会再是同一个词了。在调用此类函数之前，由您决定如何处理此类字符。那样的话
导入数据
s1 = re.sub('[.?]','', s1)
s2 = re.sub('[.?]','', s2)

然后像往常一样继续。

关于python - 不切割单词的最长公共(public)子串- python，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22726177/

25

4

0

文章推荐： c# - 使用 Caliburn Micro 执行 MahApps.Metro HamburgerMenu

文章推荐： MySQL IF 函数无法识别带有整数的字符

文章推荐： mysql - 在 MySQL 中更新时出错

带文本的 ruby 串
我正在尝试创建一个程序，其中字符串的前三个字符重复给定次数，如下所示: foo('Chocolate', 3) # => 'ChoChoCho' foo('Abc', 3) # => 'AbcAbcA
c++ - 从字符串中分 ionic 串
我有以下字符串: std::string str = "Mode:AAA:val:101:id:A1"; 我想分离一个位于 "val:" 和 ":id" 之间的子字符串，这是我的方法: std::st
c++ - 我们如何有效地压缩 DNA 串
DNA 字符串可以是任意长度，包含 5 个字母(A、T、G、C、N)的任意组合。压缩包含 5 个字母(A、T、G、C、N)的 DNA 字母串的有效方法是什么？不是考虑每个字母表 3 位，我们可以使用
python - 编辑距 ionic 串
是否有一种使用 levenstein 距离将一个特定字符串与第二个较长字符串中的任何区域进行匹配的好方法？例子: str1='aaaaa' str2='bbbbbbaabaabbbb' if str
php - mcrypt 加密将 s 串 '%00' 添加到字符串末尾
使用 OAuth 并使用以下函数使用我们称为“foo”(实际上是 OAuth token )的字符串加密 key public function encrypt( $text ) { // a

首页

博学

6Ren·AI

商城

python - 不切割单词的最长公共(public)子串- python