gpt4 book ai didi

python-3.x - 在 NLTK 中使用英国国家语料库

转载 作者:行者123 更新时间:2023-12-03 11:16:18 25 4
gpt4 key购买 nike

我是 NLTK ( http://www.nltk.org/ ) 和 python 的新手。我希望使用 NLTK python 库,但使用 BNC 作为语料库。我不相信这个语料库是通过 NLTK 数据下载分发的。有没有办法导入 NLTK 使用的 BNC 语料库。如果是这样,如何?我确实找到了一个名为 BNCCorpusReader 的函数,但不知道如何使用它。此外,在 BNC 站点,我能够下载语料库 ( http://ota.ox.ac.uk/desc/2554 )。

http://www.nltk.org/api/nltk.corpus.reader.html?highlight=bnc#nltk.corpus.reader.BNCCorpusReader.word

更新

我试过 entrophy 的建议,但得到以下错误:

raise IOError('No such file or directory: %r' % _path)
OSError: No such file or directory: 'C:\\Users\\jason\\Documents\\NetBeansProjects\\DemoCollocations\\src\\Corpora\\bnc\\A\\A0\\A00.xml'

我在语料库中读取的代码:
bnc_reader = BNCCorpusReader(root="Corpora/bnc", fileids=r'[A-K]/\w*/\w*\.xml')

并由语料库位于:
C:\Users\jason\Documents\NetBeansProjects\DemoCollocations\src\Corpora\bnc\

最佳答案

关于 nltk 用于搭配提取的示例用法,请查看以下指南:A how-to guide by nltk on collocations extraction

就 BNC 语料库阅读器而言,所有信息都在文档中。

from nltk.corpus.reader.bnc import BNCCorpusReader
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

# Instantiate the reader like this
bnc_reader = BNCCorpusReader(root="/path/to/BNC/Texts", fileids=r'[A-K]/\w*/\w*\.xml')

#And say you wanted to extract all bigram collocations and
#then later wanted to sort them just by their frequency, this is what you would do.
#Again, take a look at the link to the nltk guide on collocations for more examples.

list_of_fileids = ['A/A0/A00.xml', 'A/A0/A01.xml']
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(bnc_reader.words(fileids=list_of_fileids))
scored = finder.score_ngrams(bigram_measures.raw_freq)

print(scored)

其输出将如下所示:
[(('of', 'the'), 0.004902261167963723), (('in', 'the'),0.003554139346773699), 
(('.', 'The'), 0.0034315828175746064), (('Gift', 'Aid'), 0.0019609044671854894),
((',', 'and'), 0.0018996262025859428), (('for', 'the'), 0.0018383479379863962), ... ]

如果你想使用分数对它们进行排序,你可以尝试这样的事情
sorted_bigrams = sorted(bigram for bigram, score in scored)

print(sorted_bigrams)

结果:
[('!', 'If'), ('!', 'Of'), ('!', 'Once'), ('!', 'Particularly'), ('!', 'Raising'), 
('!', 'YOU'), ('!', '‘'), ('&', 'Ealing'), ('&', 'Public'), ('&', 'Surrey'),
('&', 'TRAINING'), ("'", 'SPONSORED'), ("'S", 'HOME'), ("'S", 'SERVICE'), ... ]

关于python-3.x - 在 NLTK 中使用英国国家语料库,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43506531/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com