gpt4 book ai didi

python - 如何保存 Python NLTK 对齐模型供以后使用?

转载 作者:太空狗 更新时间:2023-10-29 17:32:21 28 4
gpt4 key购买 nike

在 Python 中,我使用 NLTK's alignment module在平行文本之间创建单词对齐。对齐双文本可能是一个耗时的过程,尤其是在处理大量语料库时。最好有一天批量进行比对并在以后使用这些比对。

from nltk import IBMModel1 as ibm
biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)

with open(path + "eng-taq_model.txt", 'w') as f:
f.write(model.train(biverses, 20)) // makes empty file

创建模型后,如何 (1) 将其保存到磁盘并 (2) 稍后重用它?

最佳答案

直接的答案是腌制它,参见 https://wiki.python.org/moin/UsingPickle

但是因为 IBMModel1 返回一个 lambda 函数,所以不可能用默认的 pickle/cPickle 来腌制它(参见 https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L74https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L104)

所以我们将使用dill。首先,安装dill,参见Can Python pickle lambda functions?

$ pip install dill
$ python
>>> import dill as pickle

然后:

>>> import dill
>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
... pickle.dump(ibm, fout)
...
>>> exit()

要使用 pickled 模型:

>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> bitexts = comtrans.aligned_sents()[:100]
>>> with open('model1.pk', 'rb') as fin:
... ibm = pickle.load(fin)
...
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']

如果您尝试 pickle IBMModel1 对象,它是一个 lambda 函数,您将得到以下结果:

>>> import cPickle as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
... pickle.dump(ibm, fout)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle function objects

(注:以上代码片段来自NLTK 3.0.0版本)

在带有 NLTK 3.0.0 的 python3 中,您也会遇到同样的问题,因为 IBMModel1 返回一个 lambda 函数:

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('mode1.pk', 'wb') as fout:
... pickle.dump(ibm, fout)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
_pickle.PicklingError: Can't pickle <function IBMModel1.train.<locals>.<lambda> at 0x7fa37cf9d620>: attribute lookup <lambda> on nltk.align.ibm1 failed'

>>> import dill
>>> with open('model1.pk', 'wb') as fout:
... dill.dump(ibm, fout)
...
>>> exit()

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> from nltk.corpus import comtrans
>>> with open('model1.pk', 'rb') as fin:
... ibm = dill.load(fin)
...
>>> bitexts = comtrans.aligned_sents()[:100]
>>> aligned_sent = ibm.aligned(bitexts[0])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'IBMModel1' object has no attribute 'aligned'
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']

(注:在python3中,picklecPickle,参见http://docs.pythonsprints.com/python3_porting/py-porting.html)

关于python - 如何保存 Python NLTK 对齐模型供以后使用?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30195287/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com