gpt4 book ai didi

NLTK word_tokenizer 的 Python 多重处理 - 函数永远不会完成

转载 作者:行者123 更新时间:2023-11-30 22:59:59 27 4
gpt4 key购买 nike

我正在使用 NLTK 对一些相当大的数据集执行自然语言处理,并且希望利用我的所有处理器核心。似乎多处理模块就是我所追求的,当我运行以下测试代码时,我看到所有核心都在被利用,但代码永远不会完成。

执行相同的任务(无需多重处理)大约需要一分钟即可完成。

Debian 上的 Python 2.7.11。

from nltk.tokenize import word_tokenize
import io
import time
import multiprocessing as mp

def open_file(filepath):
#open and parse file
file = io.open(filepath, 'rU', encoding='utf-8')
text = file.read()
return text

def mp_word_tokenize(text_to_process):
#word tokenize
start_time = time.clock()
pool = mp.Pool(processes=8)
word_tokens = pool.map(word_tokenize, text_to_process)
finish_time = time.clock() - start_time
print 'Finished word_tokenize in [' + str(finish_time) + '] seconds. Generated [' + str(len(word_tokens)) + '] tokens'
return word_tokens

filepath = "./p40_compiled.txt"
text = open_file(filepath)
tokenized_text = mp_word_tokenize(text)

最佳答案

已弃用

这个答案已经过时了。请改为查看 https://stackoverflow.com/a/54032108/610569

<小时/>

这是一个骗子使用 sframe 进行多线程的方法:

>>> import sframe
>>> import time
>>> from nltk import word_tokenize
>>>
>>> import urllib.request
>>> url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
>>> response = urllib.request.urlopen(url)
>>> data = response.read().decode('utf8')
>>>
>>> for _ in range(10):
... start = time.time()
... for line in data.split('\n'):
... x = word_tokenize(line)
... print ('word_tokenize():\t', time.time() - start)
...
word_tokenize(): 4.058445692062378
word_tokenize(): 4.05820369720459
word_tokenize(): 4.090051174163818
word_tokenize(): 4.210559129714966
word_tokenize(): 4.17473030090332
word_tokenize(): 4.105806589126587
word_tokenize(): 4.082665681838989
word_tokenize(): 4.13646936416626
word_tokenize(): 4.185062408447266
word_tokenize(): 4.085020065307617

>>> sf = sframe.SFrame(data.split('\n'))
>>> for _ in range(10):
... start = time.time()
... x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
... print ('word_tokenize() with sframe:\t', time.time() - start)
...
word_tokenize() with sframe: 7.174573659896851
word_tokenize() with sframe: 5.072867393493652
word_tokenize() with sframe: 5.129574775695801
word_tokenize() with sframe: 5.10952091217041
word_tokenize() with sframe: 5.015898942947388
word_tokenize() with sframe: 5.037845611572266
word_tokenize() with sframe: 5.015375852584839
word_tokenize() with sframe: 5.016635894775391
word_tokenize() with sframe: 5.155989170074463
word_tokenize() with sframe: 5.132697105407715

>>> for _ in range(10):
... start = time.time()
... x = [word_tokenize(line) for line in data.split('\n')]
... print ('str.split():\t', time.time() - start)
...
str.split(): 4.176181793212891
str.split(): 4.116339921951294
str.split(): 4.1104896068573
str.split(): 4.140819549560547
str.split(): 4.103625774383545
str.split(): 4.125757694244385
str.split(): 4.10755729675293
str.split(): 4.177418947219849
str.split(): 4.11145281791687
str.split(): 4.140623092651367

请注意,速度差异可能是因为我在其他核心上运行了其他东西。但如果有更大的数据集和专用核心,您确实可以看到这种规模。

关于NLTK word_tokenizer 的 Python 多重处理 - 函数永远不会完成,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35512594/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com