gpt4 book ai didi

python - 如何在波斯语文本上创建可搜索树?

转载 作者:太空宇宙 更新时间:2023-11-03 19:40:56 25 4
gpt4 key购买 nike

我想清除波斯语文本中的停用词。我已经拥有以下链接中提供的停用词数据。在我看来,如果我有一个预先构建的停用词树,我可以节省大量时间。我想搜索这个预建树中文本的每个单词,如果该单词在树中,我将其从文本中删除,如果没有,我保留它。

O(n * l) 到 O(n*log(l))。

This is my stop-words

如果您有比预建树搜索更好的建议,我将不胜感激与我分享。

最佳答案

这是轮胎树的答案:

读取数据:

#readindg stopword data
stopwords = pd.read_csv('STOPWORDS',header=None)

轮胎树:

#creating tire tree
class TrieNode:

# Trie node class
def __init__(self):
self.children = [None]*15000

# isEndOfWord is True if node represent the end of the word
self.isEndOfWord = False

class Trie:

# Trie data structure class
def __init__(self):
self.root = self.getNode()

def getNode(self):

# Returns new trie node (initialized to NULLs)
return TrieNode()

def _charToIndex(self,ch):

# private helper function
# Converts key current character into index
# use only 'a' through 'z' and lower case

return ord(ch)-ord('!')


def insert(self,key):

# If not present, inserts key into trie
# If the key is prefix of trie node,
# just marks leaf node
pCrawl = self.root
length = len(key)
for level in range(length):
index = self._charToIndex(key[level])

# if current character is not present
if not pCrawl.children[index]:
pCrawl.children[index] = self.getNode()
pCrawl = pCrawl.children[index]

# mark last node as leaf
pCrawl.isEndOfWord = True

def search(self, key):

# Search key in the trie
# Returns true if key presents
# in trie, else false
pCrawl = self.root
length = len(key)
for level in range(length):
index = self._charToIndex(key[level])
if not pCrawl.children[index]:
return False
pCrawl = pCrawl.children[index]

return pCrawl != None and pCrawl.isEndOfWord

使用示例:

# Input keys (use only 'a' through 'z' and lower case) 
keys = list(stopwords.loc[:,0])

output = ["Not present in trie",
"Present in trie"]

# Trie object
t = Trie()

# Construct trie
for key in keys:
t.insert(key)


print("{} ---- {}".format("از",output[t.search("از")]))

输出:

از ---- Present in trie

关于python - 如何在波斯语文本上创建可搜索树?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60429159/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com