gpt4 book ai didi

python - 序列化 spaCy 文档集合的推荐方法是什么?

转载 作者:行者123 更新时间:2023-12-02 09:31:53 25 4
gpt4 key购买 nike

我正在处理大量短文本,需要对其进行注释并保存到磁盘。理想情况下,我想将它们保存/加载为 spaCy Doc 对象。显然,我不想多次保存 LanguageVocab 对象(但很乐意为 Doc 集合保存/加载一次) >s)。

Doc 对象有一个 to_disk 方法和一个 to_bytes 方法,但对我来说如何保存一堆文档并不是很明显到同一个文件。有这样做的首选方法吗?我正在寻找尽可能节省空间的东西。

目前我正在这样做,但我对此不太满意:

def serialize_docs(docs):
"""
Writes spaCy Doc objects to a newline-delimited string that can be used to load them later,
given the same Vocab object that was used to create them.
"""
return '\n'.join([codecs.encode(doc.to_bytes(), 'hex') for doc in docs])

def write_docs(filename, docs):
"""
Writes spaCy Doc objects to a file.
"""
serialized_docs = seralize_docs(docs)
with open(filename, 'w') as f:
f.write(serialized_docs)

最佳答案

从 Spacy 2.2 开始,正确答案是使用 DocBin .

作为Spacy docs现在说,

If you’re working with lots of data, you’ll probably need to pass analyses between machines, either to use something like Dask or Spark, or even just to save out work to disk. Often it’s sufficient to use the Doc.to_array functionality for this, and just serialize the numpy arrays – but other times you want a more general way to save and restore Doc objects.

The DocBin class makes it easy to serialize and deserialize a collection of Doc objects together, and is much more efficient than calling Doc.to_bytes on each individual Doc object. You can also control what data gets saved, and you can merge pallets together for easy map/reduce-style processing.

示例

import spacy
from spacy.tokens import DocBin

doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)
texts = ["Some text", "Lots of texts...", "..."]
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts):
doc_bin.add(doc)
bytes_data = doc_bin.to_bytes()

# Deserialize later, e.g. in a new process
nlp = spacy.blank("en")
doc_bin = DocBin().from_bytes(bytes_data)
docs = list(doc_bin.get_docs(nlp.vocab))

关于python - 序列化 spaCy 文档集合的推荐方法是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49618917/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com