- c - 在位数组中找到第一个零
- linux - Unix 显示有关匹配两种模式之一的文件的信息
- 正则表达式替换多个文件
- linux - 隐藏来自 xtrace 的命令
对于所有其他 NLTK 语料库,调用 corpus.raw()
会从文件中生成原始文本。例如:
>>> from nltk.corpus import webtext
>>> webtext.raw()[:10]
'Cookie Man'
但是,当调用 brown.raw()
时,您会得到标记文本。
>>> from nltk.corpus import brown
>>> brown.raw()[:10]
'\n\n\tThe/at '
我已经阅读了我能找到的所有文档,但似乎找不到明显的解释或获取未标记版本的方法。这个语料库被标记而其他语料库没有标记的原因是什么?
最佳答案
import nltk
nltk.download('brown')
nltk.download('nonbreaking_prefixes')
nltk.download('perluniprops')
from nltk.corpus import brown
from nltk.tokenize.moses import MosesDetokenizer
mdetok = MosesDetokenizer()
brown_natural = [mdetok.detokenize(' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'").split(), return_str=True) for sent in brown.sents()]
for sent in brown_natural:
print(sent)
这是因为 Brown 语料库的“原始”版本已被标记化和标记,即语料库带有标记,这是语料库的原始形式 =)
您可以查看nltk_data
目录中的各个文件:
$ head -n10 nltk_data/corpora/brown/ca01
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.
The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.
The/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl Court/nn-tl Judge/nn-tl Durwood/np Pye/np to/to investigate/vb reports/nns of/in possible/jj ``/`` irregularities/nns ''/'' in/in the/at hard-fought/jj primary/nn which/wdt was/bedz won/vbn by/in Mayor-nominate/nn-tl Ivan/np Allen/np Jr./np ./.
如果你想要语料库中的单词,你可以使用 brown.words()
,例如
>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> ' '.join(brown.words()[:30])
u"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in"
如果你想从一个特定的文件中获取单词:
>>> brown.fileids()[:10] # The first 10 fileids from brown.
[u'ca01', u'ca02', u'ca03', u'ca04', u'ca05', u'ca06', u'ca07', u'ca08', u'ca09', u'ca10']
>>> ' '.join(brown.words('ca01')[:30]) # First 30 words from the 'ca01' file.
u"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in"
以及来自特定文件的句子:
>>> brown.sents('ca01')
[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...]
要打印出单个句子:
>>> for sent in brown.sents('ca01')[:5]: # First 5 sentences.
... print(' '.join(sent))
...
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .
`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .
The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' .
尝试去标记化语料库相当困惑,可能会或可能不会,但您可以尝试 MosesDetokenizer
:
首先下载MosesDetokenizer需要的数据:
>>> import nltk
>>> nltk.download('perluniprops')
[nltk_data] Downloading package perluniprops to
[nltk_data] /home/ltan/nltk_data...
[nltk_data] Unzipping misc/perluniprops.zip.
True
>>> nltk.download('nonbreaking_prefixes')
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data] /home/ltan/nltk_data...
[nltk_data] Package nonbreaking_prefixes is already up-to-date!
True
然后初始化MosesDetokenizer
:
>>> from nltk.tokenize.moses import MosesDetokenizer
>>> mdetok = MosesDetokenizer()
并使用MosesDetokenizer.detokenize()
:
>>> for sent in brown.sents('ca01')[:5]: # First 5 sentences.
... # Join the words in sentences and convert the `` -> "
... # also convert '' -> " and ` -> '
... munged_sentence = ' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'")
... print(mdetok.detokenize(munged_sentence.split(), return_str=True)) # MosesDetokenizer expects a list of strings as input.
...
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced "no evidence" that any irregularities took place.
The jury further said in term-end presentments that the City Executive Committee, which had over-all charge of the election, "deserves the praise and thanks of the City of Atlanta" for the manner in which the election was conducted.
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible "irregularities" in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr..
"Only a relative handful of such reports was received", the jury said, "considering the widespread interest in the election, the number of voters and the size of this city".
The jury said it did find that many of Georgia's registration and election laws "are outmoded or inadequate and often ambiguous".
将brown
中的每个句子转换为自然阅读文本:
from nltk.tokenize.moses import MosesDetokenizer
mdetok = MosesDetokenizer()
brown_natural = [mdetok.detokenize(' '.join(sent).replace('``', '"').replace("''", '"').replace('`', "'").split(), return_str=True) for sent in brown.sents()]
[输出]:
>>> for sent in brown_natural:
... print(sent)
... break
...
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced "no evidence" that any irregularities took place.
关于python - 如何从 Brown 语料库访问原始文档?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47301140/
当需要将原始类型转换为字符串时,例如传递给需要字符串的方法时,基本上有两种选择。 以int为例,给出: int i; 我们可以执行以下操作之一: someStringMethod(Integer.to
我有一个位置估计数据库,并且想要计算每月的内核利用率分布。我可以使用 R 中的 adehabitat 包来完成此操作,但我想使用引导数据库中的样本来估计这些值的 95% 置信区间。今天我一直在尝试引导
我希望使用 FTP 编写大型机作业流。为此,我可以通过 FTP 连接到大型机并运行以下命令: QUOTE TYPE E QUOTE SITE FILETYPE=JES PUT myjob.jcl 那么
我是 WPF 的新手。 目前,我正在为名为“LabeledTextbox”的表单元素制作一个用户控件,其中包含一个标签、一个文本框和一个用于错误消息的文本 block 。 当使用代码添加错误消息时,我
我们正在使用 SignalR(原始版本,而不是 Core 版本)并注意到一些无法解释的行为。我们的情况如下: 我们有一个通过 GenericCommand() 方法接受命令的集线器(见下文)。 这些命
使用 requests module 时,有没有办法打印原始 HTTP 请求? 我不只想要标题,我想要请求行、标题和内容打印输出。是否可以看到最终由 HTTP 请求构造的内容? 最佳答案 Since
与直接访问现有本地磁盘或分区的物理磁盘相比,虚拟磁盘为文件存储提供更好的可移植性和效率。VMware有三种不同的磁盘类型:原始磁盘、厚磁盘和精简磁盘,它们各自分配不同的存储空间。 VMware
我有一个用一些颜色着色器等创建的门。 前段时间我拖着门,它问我该怎么办时,我选择了变体。但现在我决定选择创建原始预制件和门颜色,或者着色器变成粉红色。 这是资源中原始预制件和变体的屏幕截图。 粉红色的
我想呈现原始翻译,所以我决定在 Twig 模板中使用“原始”选项。但它不起作用。例子: {{ form_label(form.sfGuardUserProfile.roules_acceptance)
是否可以在sqlite中制作类似的东西? FOREIGN KEY(TypeCode, 'ARawValue', IdServeur) REFERENCES OTHERTABLE(TypeCode, T
这个问题是一个更具体问题的一般版本 asked here .但是,这些答案无法使用。 问题: geoIP数据的原始来源是什么? 许多网站会告诉我我的 IP 在哪里,但它们似乎都在使用来自不到 5 家公
对于Openshift:如何基于Wildfly创建docker镜像? 这是使用的Dockerfile: FROM openshift/wildfly-101-centos7 # Install exa
结果是 127 double middle = 255 / 2 虽然这产生了 127.5 Double middle = 255 / 2 同时这也会产生 127.5 double middle = (
在此处下载带有已编译可执行文件的源代码(大小:161 KB(165,230 字节)):http://www.eyeClaxton.com/download/delphi/ColorSwap.zip 原
以下几行是我需要在 lua 中使用的任意正则表达式。 ['\";=] !^(?:(?:[a-z]{3,10}\s+(?:\w{3,7}?://[\w\-\./]*(?::\d+)?)?/[^?#]*(
这个问题是一个更具体问题的一般版本 asked here .但是,这些答案无法使用。 问题: geoIP数据的原始来源是什么? 许多网站会告诉我我的 IP 在哪里,但它们似乎都在使用来自不到 5 家公
我正在使用GoLang做服务器api,试图管理和回答所发出的请求。使用net/http和github.com/gorilla/mux。 收到请求时,我使用以下结构创建响应: type Response
tl; dr:我认为我的 static_vector 有未定义的行为,但我找不到它。 这个问题是在 Microsoft Visual C++ 17 上。我有这个简单且未完成的 static_vecto
我试图找到原始 Awk (a/k/a One True Awk) 源代码的“历史”版本。我找到了 Kernighan's occasionally-updated site ,它似乎总是链接到最新版本
我在 python 中使用原始 IPv6 套接字时遇到一些问题。我通过以下方式连接: if self._socket != None: # Close out old sock
我是一名优秀的程序员,十分优秀!