gpt4 book ai didi

python - 如何递归地生成多词术语?

转载 作者:太空狗 更新时间:2023-10-30 00:23:52 26 4
gpt4 key购买 nike

假设我有一串单词:'a b c d e f'。我想从这个字符串中生成一个多词术语列表。

词序很重要。术语 'fe d' 不应从上面的示例中生成。

编辑:此外,不应跳过单词。不应生成 'a c''b d f'

我现在拥有的:

doc = 'a b c d e f'
terms= []
one_before = None
two_before = None
for word in doc.split(None):
terms.append(word)
if one_before:
terms.append(' '.join([one_before, word]))
if two_before:
terms.append(' '.join([two_before, one_before, word]))
two_before = one_before
one_before = word

for term in terms:
print term

打印:

a
b
a b
c
b c
a b c
d
c d
b c d
e
d e
c d e
f
e f
d e f

我如何使它成为一个递归函数,以便我可以向它传递每个术语的可变最大单词数?

应用:

我将使用它从 HTML 文档中的可读文本生成多词术语。总体目标是对大型语料库(约 200 万个文档)进行潜在语义分析。这就是为什么保持词序很重要(自然语言处理等)。

最佳答案

这不是递归的,但我认为它可以满足您的需求。

doc = 'a b c d e f'
words = doc.split(None)
max = 3


for index in xrange(len(words)):
for n in xrange(max):
if index + n < len(words):
print ' '.join(words[index:index+n+1])

这是一个递归的解决方案:

def find_terms(words, max_words_per_term):       
if len(words) == 0: return []
return [" ".join(words[:i+1]) for i in xrange(min(len(words), max_words_per_term))] + find_terms(words[1:], max_words_per_term)


doc = 'a b c d e f'
words = doc.split(None)
for term in find_terms(words, 3):
print term

这里又是递归函数,有一些解释变量和注释。

def find_terms(words, max_words_per_term):   

# If there are no words, you've reached the end. Stop.
if len(words) == 0:
return []

# What's the max term length you could generate from the remaining
# words? It's the lesser of max_words_per_term and how many words
# you have left.
max_term_len = min(len(words), max_words_per_term)

# Find all the terms that start with the first word.
initial_terms = [" ".join(words[:i+1]) for i in xrange(max_term_len)]

# Here's the recursion. Find all of the terms in the list
# of all but the first word.
other_terms = find_terms(words[1:], max_words_per_term)

# Now put the two lists of terms together to get the answer.
return initial_terms + other_terms

关于python - 如何递归地生成多词术语?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/702760/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com