gpt4 book ai didi

python - 我怎样才能使这段代码运行得更快? (在大语料库中搜索大词)

转载 作者:太空宇宙 更新时间:2023-11-03 14:33:50 25 4
gpt4 key购买 nike

在 Python 中,我创建了一个文本生成器,它作用于某些参数,但我的代码 - 在大多数时间 - 很慢并且性能低于我的预期。我希望每 3-4 分钟一个句子,但如果它使用的数据库很大,它就无法遵守 - 我使用 Gutenberg 项目的 18 本书语料库,我将创建我的自定义语料库并添加更多书籍,因此性能至关重要。 -算法和实现如下:

算法

1-输入触发语句-只在程序开始时输入一次-

2-获取触发句中最长的单词

3- 找出语料库中所有包含step2单词的句子

4- 随机选择其中一个句子

5- 获取在步骤 4 中选择的句子之后的句子(命名为 sentA 以解决描述中的歧义)- 只要 sentA 超过 40 个字符-

6- 转到第2步,现在触发语句是第5步的sentA

实现

from nltk.corpus import gutenberg
from random import choice

triggerSentence = raw_input("Please enter the trigger sentence:")#get input sentence from user

previousLongestWord = ""

listOfSents = gutenberg.sents()
listOfWords = gutenberg.words()
corpusSentences = [] #all sentences in the related corpus

sentenceAppender = ""

longestWord = ""

#this function is not mine, code courtesy of Dave Kirby, found on the internet about sorting list without duplication speed tricks
def arraySorter(seq):
seen = set()
return [x for x in seq if x not in seen and not seen.add(x)]


def findLongestWord(longestWord):
if(listOfWords.count(longestWord) == 1 or longestWord.upper() == previousLongestWord.upper()):
longestWord = sortedSetOfValidWords[-2]
if(listOfWords.count(longestWord) == 1):
longestWord = sortedSetOfValidWords[-3]


doappend = corpusSentences.append

def appending():

for mysentence in listOfSents: #sentences are organized into array so they can actually be read word by word.
sentenceAppender = " ".join(mysentence)
doappend(sentenceAppender)


appending()
sentencesContainingLongestWord = []

def getSentence(longestWord, sentencesContainingLongestWord):


for sentence in corpusSentences:
if sentence.count(longestWord):#if the sentence contains the longest target string, push it into the sentencesContainingLongestWord list
sentencesContainingLongestWord.append(sentence)


def lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord):

while(len(corpusSentences[sentenceIndex + 1]) < 40):#in case the next sentence is shorter than 40 characters, pick another trigger sentence
sentencesContainingLongestWord.remove(triggerSentence)
triggerSentence = choice(sentencesContainingLongestWord)
sentenceIndex = corpusSentences.index(triggerSentence)

while len(triggerSentence) > 0: #run the loop as long as you get a trigger sentence

sentencesContainingLongestWord = []#all the sentences that include the longest word are to be inserted into this set

setOfValidWords = [] #set for words in a sentence that exists in a corpus

split_str = triggerSentence.split()#split the sentence into words

setOfValidWords = [word for word in split_str if listOfWords.count(word)]

sortedSetOfValidWords = arraySorter(sorted(setOfValidWords, key = len))

longestWord = sortedSetOfValidWords[-1]

findLongestWord(longestWord)

previousLongestWord = longestWord

getSentence(longestWord, sentencesContainingLongestWord)

triggerSentence = choice(sentencesContainingLongestWord)

sentenceIndex = corpusSentences.index(triggerSentence)

lengthCheck(sentenceIndex, triggerSentence, sentencesContainingLongestWord)

triggerSentence = corpusSentences[sentenceIndex + 1]#get the sentence that is next to the previous trigger sentence

print triggerSentence
print "\n"

corpusSentences.remove(triggerSentence)#in order to view the sentence index numbers, you can remove this one so index numbers are concurrent with actual gutenberg numbers


print "End of session, please rerun the program"
#initiated once the while loop exits, so that the program ends without errors

我运行代码的计算机有点旧,双核 CPU 是 2006 年 2 月购买的,2x512 RAM 是 2004 年 9 月购买的,所以我不确定是我的实现不好还是硬件问题运行时间慢的原因。关于如何将其从危险形式中解救出来的任何想法?提前致谢。

最佳答案

我认为我的第一个建议必须是:仔细考虑您的例程的作用,并确保名称描述了它。目前你有这样的东西:

  • arraySorter 既不处理 arrays也不排序(它是 nub 的一个实现)
  • findLongestWord 根据算法描述中不存在的标准计算事物或选择单词,但最终什么都不做,因为 longestWord 是一个局部变量(可以说是参数)
  • getSentence 将任意数量的句子附加到列表中
  • appending 这听起来像是一个状态检查器,但仅通过副作用运行
  • 局部变量和全局变量之间存在相当大的混淆,例如从未使用过全局变量 sentenceAppender,它也不是顾名思义的参与者(例如函数)

对于任务本身,您真正需要的是索引。索引每个单词可能有点矫枉过正——从技术上讲,您应该只需要为作为句子中最长单词出现的单词建立索引条目。字典是您在这里的主要工具,第二个工具是列表。一旦你有了这些索引,查找一个包含任何给定单词的随机句子只需要一个字典查找,一个 random.choice和列表查找。考虑到句子长度限制,可能需要进行一些列表查找。

这个例子应该证明一个很好的实例教训,即现代硬件或像 Psyco 这样的优化器不能解决算法问题。

关于python - 我怎样才能使这段代码运行得更快? (在大语料库中搜索大词),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5759959/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com