gpt4 book ai didi

text - 如何在文档中找到一组关键字,所有/部分关键字都在一定距离截止?

转载 作者:行者123 更新时间:2023-12-05 00:58:15 24 4
gpt4 key购买 nike

我有一组关键字,大约 10 个。我想在一个很长的文档中执行搜索,并检查我是否可以在那里找到这组关键字,但不仅是它们在文本中的存在或存在,还包括全部/部分它们中的一部分或它们的子集位于例如 3 个句子或 30 个单词或任何其他接近度指标的距离截止处。怎么能做到这一点?我刚刚想到编写一些 Python 代码来查找其中一个关键字,然后检查其他关键字是否在 3 行左右的文本中。但这需要大量的计算能力,而且效率低下。

最佳答案

要确定一组关键字是否在较大文档内的给定距离内,您可以使用长度等于给定距离的滑动窗口并将其移动到整个文档中。当您移动窗口时,跟踪落入和落出窗口的每个单词。如果窗口曾经包含所有关键字,则满足条件。这种方法的时间复杂度为 O(len(document))以及 O(len(window)) 的内存复杂度.

以下是上述方法在 Python 中的示例实现:

from collections import defaultdict
from sets import Set
def isInProximityWindow(doc, keywords, windowLen):
words = doc.split()
wordsLen = len(words)
if (windowLen > wordsLen):
windowLen = wordsLen

keywordsLen = len(keywords)
allKeywordLocs = defaultdict(Set)
numKeywordsInWindow = 0
locKeyword = {}
for i in range(wordsLen):
windowContents = sorted([k for k in allKeywordLocs.keys() if allKeywordLocs[k]])
print "On beginning of iteration #%i, window contains '%s'" % (i, ','.join(windowContents))

oldKeyword = locKeyword.pop(i-windowLen, None)
if oldKeyword:
keywordLocs = allKeywordLocs[oldKeyword]
keywordLocs.remove(i-windowLen)
if not keywordLocs:
print "'%s' fell out of window" % oldKeyword
numKeywordsInWindow -= 1
word = words[i]
print "Next word is '%s'" % word
if word in keywords:
locKeyword[i] = word
keywordLocs = allKeywordLocs[word]
if not keywordLocs:
print "'%s' fell in window" % word
numKeywordsInWindow += 1
if numKeywordsInWindow == keywordsLen:
return True
keywordLocs.add(i)
return False

示例输出:
>>> isInProximityWindow("the brown cow jumped over the moon and the red fox jumped over the black dog", Set(["fox", "over", "the"]), 4)
On beginning of iteration #0, window contains ''
Next word is 'the'
'the' fell in window
On beginning of iteration #1, window contains 'the'
Next word is 'brown'
On beginning of iteration #2, window contains 'the'
Next word is 'cow'
On beginning of iteration #3, window contains 'the'
Next word is 'jumped'
On beginning of iteration #4, window contains 'the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #5, window contains 'over'
Next word is 'the'
'the' fell in window
On beginning of iteration #6, window contains 'over,the'
Next word is 'moon'
On beginning of iteration #7, window contains 'over,the'
Next word is 'and'
On beginning of iteration #8, window contains 'over,the'
'over' fell out of window
Next word is 'the'
On beginning of iteration #9, window contains 'the'
Next word is 'red'
On beginning of iteration #10, window contains 'the'
Next word is 'fox'
'fox' fell in window
On beginning of iteration #11, window contains 'fox,the'
Next word is 'jumped'
On beginning of iteration #12, window contains 'fox,the'
'the' fell out of window
Next word is 'over'
'over' fell in window
On beginning of iteration #13, window contains 'fox,over'
Next word is 'the'
'the' fell in window
True

关于text - 如何在文档中找到一组关键字,所有/部分关键字都在一定距离截止?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33142077/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com