gpt4 book ai didi

python - 无法在 AWS EMR 上下载 nltk 语料库,对已关闭文件进行 I/O 操作

转载 作者:行者123 更新时间:2023-12-05 02:58:04 28 4
gpt4 key购买 nike

使用 JupyterLab 打开我的 EMR 集群后。我无法使用 nltk.download() 下载额外的语料库。

代码

nltk.download('wordnet')

错误

I/O operation on closed file
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 817, in download
show('Downloading collection %r' % msg.collection.id)
File "/usr/local/lib/python3.6/site-packages/nltk/downloader.py", line 783, in show
subsequent_indent=prefix + prefix2 + ' ' * 4,
File "/tmp/4461650941863117011", line 534, in write
super(UnicodeDecodingStringIO, self).write(s)
ValueError: I/O operation on closed file

这是在使用 sc.list_packages() 确认安装了 nltk 之后。

Package                    Version
-------------------------- -------
...
nltk 3.4.5
...

并使用 import nltk 导入 nltk。

感觉这个问题是因为我对 EMR 的设置方式缺乏了解。

有什么我应该尝试调试的吗?

更新:

我已经尝试将它安装在引导脚本中,该脚本可以正确安装。

pip install nltk
python -m nltk.downloader wordnet

但是我在尝试使用它时仍然遇到这个错误。

An error occurred while calling o166.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 7, ip-172-31-1-163.ca-central-1.compute.internal, executor 3): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/nltk/corpus/util.py", line 86, in __load
root = nltk.data.find('{}/{}'.format(self.subdir, zip_name))
File "/usr/local/lib/python3.6/site-packages/nltk/data.py", line 701, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource wordnet not found.
Please use the NLTK Downloader to obtain the resource:

>>> import nltk
>>> nltk.download('wordnet')

For more information see: https://www.nltk.org/data.html

Attempted to load corpora/wordnet.zip/wordnet/

Searched in:
- '/home/nltk_data'
- '/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/virtualenv_application_1576604798325_0001_0/nltk_data'
- '/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/virtualenv_application_1576604798325_0001_0/share/nltk_data'
- '/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/virtualenv_application_1576604798325_0001_0/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/worker.py", line 377, in main
process()
File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/worker.py", line 372, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/serializers.py", line 345, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
for obj in iterator:
File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/serializers.py", line 334, in _batched
for item in iterator:
File "<string>", line 1, in <lambda>
File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
return lambda *a: f(*a)
File "/mnt1/yarn/usercache/livy/appcache/application_1576604798325_0001/container_1576604798325_0001_01_000005/pyspark.zip/pyspark/util.py", line 113, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 19, in <lambda>
File "<stdin>", line 19, in <listcomp>
File "/usr/local/lib/python3.6/site-packages/nltk/stem/wordnet.py", line 41, in lemmatize
lemmas = wordnet._morphy(word, pos)
File "/usr/local/lib/python3.6/site-packages/nltk/corpus/util.py", line 123, in __getattr__
self.__load()
File "/usr/local/lib/python3.6/site-packages/nltk/corpus/util.py", line 88, in __load
raise e
File "/usr/local/lib/python3.6/site-packages/nltk/corpus/util.py", line 83, in __load
root = nltk.data.find('{}/{}'.format(self.subdir, self.__name))
File "/usr/local/lib/python3.6/site-packages/nltk/data.py", line 701, in find
raise LookupError(resource_not_found)
LookupError:

更新

我找到了 wordnet 被 shell 脚本下载到的目录,并通过 ssh 进入服务器确认它确实存在。

[nltk_data] Downloading package wordnet to /root/nltk_data...

所以在 jupyter 中我正在检查 nltk.data.path

['/var/lib/livy/nltk_data', '/tmp/1576616653412-0/nltk_data', '/tmp/1576616653412-0/share/nltk_data', '/tmp/1576616653412-0/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']

并附加我的新路径。

nltk.data.path.append('/root/nltk_data')
nltk.data.path

我们可以看到它添加了。

['/var/lib/livy/nltk_data', '/tmp/1576616653412-0/nltk_data', '/tmp/1576616653412-0/share/nltk_data', '/tmp/1576616653412-0/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data', '/root/nltk_data']

但当我尝试调用使用该语料库的函数时,它仍然没有被搜索到。

  Resource wordnet not found.
Please use the NLTK Downloader to obtain the resource:

>>> import nltk
>>> nltk.download('wordnet')

For more information see: https://www.nltk.org/data.html

Attempted to load corpora/wordnet.zip/wordnet/

Searched in:
- '/home/nltk_data'
- '/mnt1/yarn/usercache/livy/appcache/application_1576615748346_0001/container_1576615748346_0001_01_000006/virtualenv_application_1576615748346_0001_0/nltk_data'
- '/mnt1/yarn/usercache/livy/appcache/application_1576615748346_0001/container_1576615748346_0001_01_000006/virtualenv_application_1576615748346_0001_0/share/nltk_data'
- '/mnt1/yarn/usercache/livy/appcache/application_1576615748346_0001/container_1576615748346_0001_01_000006/virtualenv_application_1576615748346_0001_0/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'

这里没有引用/root/nltk_data

最佳答案

鉴于无法更改用于加载 wordnet 的路径(更改 nltk.data.path 并未更改 nltk 查找文件的位置)。

我必须更改它从启动脚本下载到的目录,以匹配 nltk 默认的位置。

启动脚本

sudo pip install nltk
sudo python -m nltk.downloader -d /home/nltk_data wordnet

关于python - 无法在 AWS EMR 上下载 nltk 语料库,对已关闭文件进行 I/O 操作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59379230/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com