gpt4 book ai didi

python - 如何将 NLTK block 输出到文件?

转载 作者:太空狗 更新时间:2023-10-30 01:51:02 26 4
gpt4 key购买 nike

我有这个 python 脚本,我在其中使用 nltk 库来解析、标记、标记和分块一些让我们说来自网络的随机文本。

我需要将 chunked1chunked2chunked3 的输出格式化并写入文件。它们具有 class 'nltk.tree.Tree'

类型

更具体地说,我只需要编写与正则表达式 chunkGram1chunkGram2chunkGram3 匹配的行。

我该怎么做?

#! /usr/bin/python2.7

import nltk
import re
import codecs

xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."]


def processLanguage():
for item in xstring:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
#print tokenized
#print tagged

chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}"""
chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}"""

chunkParser1 = nltk.RegexpParser(chunkGram1)
chunked1 = chunkParser1.parse(tagged)

chunkParser2 = nltk.RegexpParser(chunkGram2)
chunked2 = chunkParser2.parse(tagged)

chunkParser3 = nltk.RegexpParser(chunkGram3)
chunked3 = chunkParser2.parse(tagged)

#print chunked1
#print chunked2
#print chunked3

# with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile:

# for i,line in enumerate(chunked1):
# if "JJ" in line:
# outfile.write(line)
# elif "NNP" in line:
# outfile.write(line)



processLanguage()

暂时当我尝试运行它时出现错误:

`Traceback (most recent call last):
File "sentdex.py", line 47, in <module>
processLanguage()
File "sentdex.py", line 40, in processLanguage
outfile.write(line)
File "C:\Python27\lib\codecs.py", line 688, in write
return self.writer.write(data)
File "C:\Python27\lib\codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
TypeError: coercing to Unicode: need string or buffer, tuple found`

编辑:在@Alvas 回答后我设法做了我想做的事。但是现在,我想知道如何从文本语料库 中去除所有非 ascii 字符。示例:

#store cleaned file into variable
with open('path\to\file.txt', 'r') as infile:
xstring = infile.readlines()
infile.close

def remove_non_ascii(line):
return ''.join([i if ord(i) < 128 else ' ' for i in line])

for i, line in enumerate(xstring):
line = remove_non_ascii(line)

#tokenize and tag text
def processLanguage():
for item in xstring:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
print tokenized
print tagged
processLanguage()

以上内容摘自 S/O 中的另一个答案。但是它似乎不起作用。可能出了什么问题?我得到的错误是:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
not in range(128)

最佳答案

首先,请看这个视频:https://www.youtube.com/watch?v=0Ef9GudbxXY

enter image description here

现在是正确答案:

import re
import io

from nltk import pos_tag, word_tokenize, sent_tokenize, RegexpParser


xstring = u"An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."


chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkParser1 = RegexpParser(chunkGram1)

chunked = [chunkParser1.parse(pos_tag(word_tokenize(sent)))
for sent in sent_tokenize(xstring)]

with io.open('outfile', 'w', encoding='utf8') as fout:
for chunk in chunked:
fout.write(str(chunk)+'\n\n')

[输出]:

alvas@ubi:~$ python test2.py
Traceback (most recent call last):
File "test2.py", line 18, in <module>
fout.write(str(chunk)+'\n\n')
TypeError: must be unicode, not str
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC

如果非要坚持使用python2.7:

with io.open('outfile', 'w', encoding='utf8') as fout:
for chunk in chunked:
fout.write(unicode(chunk)+'\n\n')

[输出]:

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC
alvas@ubi:~$ python3 test2.py
Traceback (most recent call last):
File "test2.py", line 18, in <module>
fout.write(unicode(chunk)+'\n\n')
NameError: name 'unicode' is not defined

如果你必须坚持使用 py2.7,强烈推荐:

from six import text_type
with io.open('outfile', 'w', encoding='utf8') as fout:
for chunk in chunked:
fout.write(text_type(chunk)+'\n\n')

[输出]:

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile
(S
An/DT
(Chunk electronic/JJ library/NN)
(/:
also/RB
referred/VBD
to/TO
as/IN
(Chunk digital/JJ library/NN)
or/CC

关于python - 如何将 NLTK block 输出到文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28365626/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com