gpt4 book ai didi

python multiprocessing - 文本处理

转载 作者:行者123 更新时间:2023-12-02 07:13:09 25 4
gpt4 key购买 nike

我正在尝试创建我发现的文本分类代码的多处理版本 here (以及其他很酷的事情)。我在下面附加了完整的代码。

我尝试了一些方法 - 首先尝试了 lambda 函数,但它提示无法序列化(!?),因此尝试了原始代码的精简版本:

  negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

p = Pool(2)
negfeats =[]
posfeats =[]

for f in negids:
words = movie_reviews.words(fileids=[f])
negfeats = p.map(featx, words) #not same form as below - using for debugging

print len(negfeats)

不幸的是,这也不起作用 - 我得到以下跟踪:

File "/usr/lib/python2.6/multiprocessing/pool.py", line 148, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.6/multiprocessing/pool.py", line 422, in get
raise self._value
ZeroDivisionError: float division

知道我可能做错了什么吗?我应该使用 pool.apply_async 来代替(它本身似乎也不能解决问题 - 但也许我找错了树)?

import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def evaluate_classifier(featx):
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4

trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

classifier = NaiveBayesClassifier.train(trainfeats)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (feats, label) in enumerate(testfeats):
refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)

print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
classifier.show_most_informative_features()

最佳答案

关于您的精简版本,您是否使用了与 http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/ 中使用的不同的 featx 函数? ?

异常最有可能发生在featx内部,并且多处理只是重新引发它,尽管它并没有真正包含原始的回溯,这使得它有点无用。

首先尝试在不使用 pool.map() 的情况下运行它(即 negfeats = [feat(x) for x in Words]),或者在 featx 中包含一些可以调试的内容。

如果这仍然没有帮助,请在原始问题中发布您正在处理的整个脚本(如果可能的话,已经简化),以便其他人可以运行该脚本并提供更有针对性的答案。请注意,以下代码片段实际上有效(调整您的精简版本):

from nltk.corpus import movie_reviews
from multiprocessing import Pool

def featx(words):
return dict([(word, True) for word in words])

if __name__ == "__main__":
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

p = Pool(2)
negfeats =[]
posfeats =[]

for f in negids:
words = movie_reviews.words(fileids=[f])
negfeats = p.map(featx, words)

print len(negfeats)

关于python multiprocessing - 文本处理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3081044/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com