gpt4 book ai didi

python-3.x - Python : Spacy and memory consumption

转载 作者:行者123 更新时间:2023-12-01 00:13:25 25 4
gpt4 key购买 nike

1-问题

我在python上使用“spacy”对文本文档进行词素化。
有500,000个文档,其最大大小为20 Mb的纯文本。

问题如下:在使用整个内存之前,spacy内存消耗随着时间的推移而增长。

2-背景

我的硬件配置:
CPU:Intel I7-8700K 3.7 GHz(12核心)
内存体:16 Gb
固态硬盘:1 TB
板载GPU,但不用于此任务

我正在使用“多处理”将任务划分为多个流程(工作人员)。
每个工作人员都会收到一份要处理的文件 list 。
主进程执行子进程的监视。
我在每个子进程中启动一次“spacy”,并使用这个spacy实例处理工作程序中的整个文档列表。

内存跟踪显示以下内容:

[ Memory trace - Top 10 ]

/opt/develop/virtualenv/lib/python3.6/site-packages/thinc/neural/mem.py:68: size=45.1 MiB, count=99, average=467 KiB

/opt/develop/virtualenv/lib/python3.6/posixpath.py:149: size=40.3 MiB, count=694225, average=61 B

:487: size=9550 KiB, count=77746, average=126 B

/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:33: size=7901 KiB, count=6, average=1317 KiB

/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_nouns.py:7114: size=5273 KiB, count=57494, average=94 B

prepare_docs04.py:372: size=4189 KiB, count=1, average=4189 KiB

/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:93: size=3949 KiB, count=5, average=790 KiB

/usr/lib/python3.6/json/decoder.py:355: size=1837 KiB, count=20456, average=92 B

/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_adjectives.py:2828: size=1704 KiB, count=20976, average=83 B

prepare_docs04.py:373: size=1633 KiB, count=1, average=1633 KiB



3-期望

我已经看到一个很好的建议,可以构建一个单独的服务器-客户端解决方案[此处] Is possible to keep spacy in memory to reduce the load time?

使用“多处理”方法是否可以控制内存消耗?

4-代码

这是我的代码的简化版本:

import os, subprocess, spacy, sys, tracemalloc
from multiprocessing import Pipe, Process, Lock
from time import sleep

# START: memory trace
tracemalloc.start()

# Load spacy
spacyMorph = spacy.load("en_core_web_sm")

#
# Get word's lemma
#
def getLemma(word):
global spacyMorph
lemmaOutput = spacyMorph(str(word))
return lemmaOutput


#
# Worker's logic
#
def workerNormalize(lock, conn, params):
documentCount = 1
for filenameRaw in params[1]:
documentTotal = len(params[1])
documentID = int(os.path.basename(filenameRaw).split('.')[0])

# Send to the main process the worker's current progress
if not lock is None:
lock.acquire()
try:
statusMessage = "WORKING:{:d},{:d},".format(documentID, documentCount)
conn.send(statusMessage)
documentCount += 1
finally:
lock.release()
else:
print(statusMessage)

# ----------------
# Some code is excluded for clarity sake
# I've got a "wordList" from file "filenameRaw"
# ----------------

wordCount = 1
wordTotalCount = len(wordList)

for word in wordList:
lemma = getLemma(word)
wordCount += 1

# ----------------
# Then I collect all lemmas and save it to another text file
# ----------------

# Here I'm trying to reduce memory usage
del wordList
del word
gc.collect()


if __name__ == '__main__':
lock = Lock()
processList = []

# ----------------
# Some code is excluded for clarity sake
# Here I'm getting full list of files "fileTotalList" which I need to lemmatize
# ----------------
while cursorEnd < (docTotalCount + stepSize):
fileList = fileTotalList[cursorStart:cursorEnd]

# ----------------
# Create workers and populate it with list of files to process
# ----------------
processData = {}
processData['total'] = len(fileList) # worker total progress
processData['count'] = 0 # worker documents done count
processData['currentDocID'] = 0 # current document ID the worker is working on
processData['comment'] = '' # additional comment (optional)
processData['con_parent'], processData['con_child'] = Pipe(duplex=False)
processName = 'worker ' + str(count) + " at " + str(cursorStart)
processData['handler'] = Process(target=workerNormalize, name=processName, args=(lock, processData['con_child'], [processName, fileList]))

processList.append(processData)
processData['handler'].start()

cursorStart = cursorEnd
cursorEnd += stepSize
count += 1

# ----------------
# Run the monitor to look after the workers
# ----------------
while True:
runningCount = 0

#Worker communication format:
#STATUS:COMMENTS

#STATUS:
#- WORKING - worker is working
#- CLOSED - worker has finished his job and closed pipe-connection

#COMMENTS:
#- for WORKING status:
#DOCID,COUNT,COMMENTS
#DOCID - current document ID the worker is working on
#COUNT - count of done documents
#COMMENTS - additional comments (optional)


# ----------------
# Run through the list of workers ...
# ----------------
for i, process in enumerate(processList):
if process['handler'].is_alive():
runningCount += 1

# ----------------
# .. and check if there is somethng in the PIPE
# ----------------
if process['con_parent'].poll():
try:
message = process['con_parent'].recv()
status = message.split(':')[0]
comment = message.split(':')[1]

# ----------------
# Some code is excluded for clarity sake
# Update worker's information and progress in "processList"
# ----------------

except EOFError:
print("EOF----")

# ----------------
# Some code is excluded for clarity sake
# Here I draw some progress lines per workers
# ----------------

else:
# worker has finished his job. Close the connection.
process['con_parent'].close()

# Whait for some time and monitor again
sleep(PARAM['MONITOR_REFRESH_FREQUENCY'])


print("================")
print("**** DONE ! ****")
print("================")

# ----------------
# Here I'm measuring memory usage to find the most "gluttonous" part of the code
# ----------------
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')

print("[ Memory trace - Top 10 ]")
for stat in top_stats[:10]:
print(stat)


'''

最佳答案

内存泄漏带有杂乱无章

处理大量数据时的内存问题似乎是一个已知问题,请参阅一些相关的github问题:

  • https://github.com/explosion/spaCy/issues/3623
  • https://github.com/explosion/spaCy/issues/3556

  • 不幸的是,看起来还没有一个好的解决方案。

    合法化

    查看您特定的词形分解任务,我认为您的示例代码有点过于简化,因为您是在单个单词上运行完整的spacy流水线,然后对结果不做任何事情(甚至不检查词形吗?),所以很难说出您实际想要做什么。

    我假设您只是想进行lemmatize,所以通常来说,您希望尽可能多地禁用不使用的管道部分(尤其是如果仅进行lemmatizing时进行分析,请参阅 https://spacy.io/usage/processing-pipelines#disabling),并使用 nlp.pipe进行操作。分批处理文件。如果您使用解析器或实体识别,Spacy将无法处理很长的文档,因此您需要以某种方式分解文本(或者仅用于词条修饰/标记,您可以根据需要增加 nlp.max_length)。

    像在您的示例中那样将文档分解为单个单词会破坏大多数spacy分析的目的(您通常无法有意义地标记或解析单个单词),而且以这种方式调用spacy会非常慢。

    查找词形化

    如果您只是出于上下文需要普通词的引理(标记器将不会提供任何有用的信息),则可以查看查找词法识别器是否足以胜任您的任务,并跳过其余的处理过程:
    from spacy.lemmatizer import Lemmatizer
    from spacy.lang.en import LOOKUP
    lemmatizer = Lemmatizer(lookup=LOOKUP)
    print(lemmatizer(u"ducks", ''), lemmatizer(u"ducking", ''))

    输出:

    ['duck'] ['duck']



    它只是一个静态的查询表,因此在未知单词或大写的“wugs”或“DUCKS”等单词上都无法很好地工作,因此您必须查看它是否可以很好地用于文本。没有内存泄漏的速度要快得多。 (您也可以随意使用该表,而不必担心,它在这里: https://github.com/michmech/lemmatization-lists。)

    更好的词素化

    否则,请使用类似以下的内容来批量处理文本:
    nlp = spacy.load('en', disable=['parser', 'ner'])
    # if needed: nlp.max_length = MAX_DOC_LEN_IN_CHAR
    for doc in nlp.pipe(texts):
    for token in doc:
    print(token.lemma_)

    如果您处理一个长文本(或对许多较短的文本使用 nlp.pipe())而不是处理单个单词,则应该能够在一个线程中每秒标记(/标记)成千上万个单词。

    关于python-3.x - Python : Spacy and memory consumption,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55841087/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com