- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我正在尝试使用斯坦福词性标注器和 NER 编写一个关键字提取程序。对于关键字提取,我只对专有名词感兴趣。这是基本方法
示例代码
docText="'Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London"
words = re.split("\W+",docText)
stops = set(stopwords.words("english"))
#remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]
# Stemming
pstem = PorterStemmer()
words = [pstem.stem(w) for w in words]
nounsWeWant = set(['NN' ,'NNS', 'NNP', 'NNPS'])
finalWords = []
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
for w in words:
if stp.tag([w.lower()])[0][1] not in nounsWeWant:
finalWords.append(w.lower())
else:
finalWords.append(w)
finalString = " ".join(finalWords)
print finalString
tagged = stn.tag(finalWords)
print tagged
这给了我
Jack Frost work Boe Compani manag aircraft crew London
[(u'Jack', u'PERSON'), (u'Frost', u'PERSON'), (u'work', u'O'), (u'Boe', u'O'), (u'Compani', u'O'), (u'manag', u'O'), (u'aircraft', u'O'), (u'crew', u'O'), (u'London', u'LOCATION')]
很明显,我不希望波音公司被阻止。也不是公司。我需要对这些词进行词干处理,因为我的输入可能包含诸如 Performing
之类的术语。我发现像 Performing
这样的词会被 NER 识别为专有名词,因此可以归类为 Organization
。因此,首先我将所有单词词干并转换为小写。然后我检查该词的词性标签是否是名词。如果是这样,我保持原样。如果不是,我会将单词转换为小写并将其添加到将传递给 NER 的最终单词列表中。
知道如何避免专有名词的词干吗?
最佳答案
使用完整的斯坦福 CoreNLP 管道来处理您的 NLP 工具链。避免使用自己的分词器、清理器、POS 标记器等。它不能与 NER 工具很好地配合。
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip
unzip http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip
cd stanford-corenlp-full-2015-12-09
echo "Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London" > test.txt
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file test.txt
cat test.txt.out
[输出]:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
<document>
<sentences>
<sentence id="1">
<tokens>
<token id="1">
<word>Jack</word>
<lemma>Jack</lemma>
<CharacterOffsetBegin>0</CharacterOffsetBegin>
<CharacterOffsetEnd>4</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>PERSON</NER>
<Speaker>PER0</Speaker>
</token>
<token id="2">
<word>Frost</word>
<lemma>Frost</lemma>
<CharacterOffsetBegin>5</CharacterOffsetBegin>
<CharacterOffsetEnd>10</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>PERSON</NER>
<Speaker>PER0</Speaker>
</token>
<token id="3">
<word>works</word>
<lemma>work</lemma>
<CharacterOffsetBegin>11</CharacterOffsetBegin>
<CharacterOffsetEnd>16</CharacterOffsetEnd>
<POS>VBZ</POS>
<NER>O</NER>
<Speaker>PER0</Speaker>
</token>
<token id="4">
<word>for</word>
<lemma>for</lemma>
<CharacterOffsetBegin>17</CharacterOffsetBegin>
<CharacterOffsetEnd>20</CharacterOffsetEnd>
<POS>IN</POS>
<NER>O</NER>
<Speaker>PER0</Speaker>
</token>
<token id="5">
<word>Boeing</word>
<lemma>Boeing</lemma>
<CharacterOffsetBegin>21</CharacterOffsetBegin>
<CharacterOffsetEnd>27</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>ORGANIZATION</NER>
<Speaker>PER0</Speaker>
</token>
<token id="6">
<word>Company</word>
<lemma>Company</lemma>
<CharacterOffsetBegin>28</CharacterOffsetBegin>
<CharacterOffsetEnd>35</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>ORGANIZATION</NER>
<Speaker>PER0</Speaker>
</token>
<token id="7">
<word>.</word>
<lemma>.</lemma>
<CharacterOffsetBegin>35</CharacterOffsetBegin>
<CharacterOffsetEnd>36</CharacterOffsetEnd>
<POS>.</POS>
<NER>O</NER>
<Speaker>PER0</Speaker>
</token>
</tokens>
<parse>(ROOT (S (NP (NNP Jack) (NNP Frost)) (VP (VBZ works) (PP (IN for) (NP (NNP Boeing) (NNP Company)))) (. .))) </parse>
<dependencies type="basic-dependencies">
<dep type="root">
<governor idx="0">ROOT</governor>
<dependent idx="3">works</dependent>
</dep>
<dep type="compound">
<governor idx="2">Frost</governor>
<dependent idx="1">Jack</dependent>
</dep>
<dep type="nsubj">
<governor idx="3">works</governor>
<dependent idx="2">Frost</dependent>
</dep>
<dep type="case">
<governor idx="6">Company</governor>
<dependent idx="4">for</dependent>
</dep>
<dep type="compound">
<governor idx="6">Company</governor>
<dependent idx="5">Boeing</dependent>
</dep>
<dep type="nmod">
<governor idx="3">works</governor>
<dependent idx="6">Company</dependent>
</dep>
<dep type="punct">
<governor idx="3">works</governor>
<dependent idx="7">.</dependent>
</dep>
</dependencies>
<dependencies type="collapsed-dependencies">
<dep type="root">
<governor idx="0">ROOT</governor>
<dependent idx="3">works</dependent>
</dep>
<dep type="compound">
<governor idx="2">Frost</governor>
<dependent idx="1">Jack</dependent>
</dep>
<dep type="nsubj">
<governor idx="3">works</governor>
<dependent idx="2">Frost</dependent>
</dep>
<dep type="case">
<governor idx="6">Company</governor>
<dependent idx="4">for</dependent>
</dep>
<dep type="compound">
<governor idx="6">Company</governor>
<dependent idx="5">Boeing</dependent>
</dep>
<dep type="nmod:for">
<governor idx="3">works</governor>
<dependent idx="6">Company</dependent>
</dep>
<dep type="punct">
<governor idx="3">works</governor>
<dependent idx="7">.</dependent>
</dep>
</dependencies>
<dependencies type="collapsed-ccprocessed-dependencies">
<dep type="root">
<governor idx="0">ROOT</governor>
<dependent idx="3">works</dependent>
</dep>
<dep type="compound">
<governor idx="2">Frost</governor>
<dependent idx="1">Jack</dependent>
</dep>
<dep type="nsubj">
<governor idx="3">works</governor>
<dependent idx="2">Frost</dependent>
</dep>
<dep type="case">
<governor idx="6">Company</governor>
<dependent idx="4">for</dependent>
</dep>
<dep type="compound">
<governor idx="6">Company</governor>
<dependent idx="5">Boeing</dependent>
</dep>
<dep type="nmod:for">
<governor idx="3">works</governor>
<dependent idx="6">Company</dependent>
</dep>
<dep type="punct">
<governor idx="3">works</governor>
<dependent idx="7">.</dependent>
</dep>
</dependencies>
</sentence>
<sentence id="2">
<tokens>
<token id="1">
<word>He</word>
<lemma>he</lemma>
<CharacterOffsetBegin>37</CharacterOffsetBegin>
<CharacterOffsetEnd>39</CharacterOffsetEnd>
<POS>PRP</POS>
<NER>O</NER>
<Speaker>PER0</Speaker>
</token>
<token id="2">
<word>manages</word>
<lemma>manage</lemma>
<CharacterOffsetBegin>40</CharacterOffsetBegin>
<CharacterOffsetEnd>47</CharacterOffsetEnd>
<POS>VBZ</POS>
<NER>O</NER>
<Speaker>PER0</Speaker>
</token>
<token id="3">
<word>5</word>
<lemma>5</lemma>
<CharacterOffsetBegin>48</CharacterOffsetBegin>
<CharacterOffsetEnd>49</CharacterOffsetEnd>
<POS>CD</POS>
<NER>NUMBER</NER>
<NormalizedNER>5.0</NormalizedNER>
<Speaker>PER0</Speaker>
</token>
<token id="4">
<word>aircraft</word>
<lemma>aircraft</lemma>
<CharacterOffsetBegin>50</CharacterOffsetBegin>
<CharacterOffsetEnd>58</CharacterOffsetEnd>
<POS>NN</POS>
<NER>O</NER>
<Speaker>PER0</Speaker>
</token>
<token id="5">
<word>and</word>
<lemma>and</lemma>
<CharacterOffsetBegin>59</CharacterOffsetBegin>
<CharacterOffsetEnd>62</CharacterOffsetEnd>
<POS>CC</POS>
<NER>O</NER>
<Speaker>PER0</Speaker>
</token>
<token id="6">
<word>their</word>
<lemma>they</lemma>
<CharacterOffsetBegin>63</CharacterOffsetBegin>
<CharacterOffsetEnd>68</CharacterOffsetEnd>
<POS>PRP$</POS>
<NER>O</NER>
<Speaker>PER0</Speaker>
</token>
<token id="7">
<word>crew</word>
<lemma>crew</lemma>
<CharacterOffsetBegin>69</CharacterOffsetBegin>
<CharacterOffsetEnd>73</CharacterOffsetEnd>
<POS>NN</POS>
<NER>O</NER>
<Speaker>PER0</Speaker>
</token>
<token id="8">
<word>in</word>
<lemma>in</lemma>
<CharacterOffsetBegin>74</CharacterOffsetBegin>
<CharacterOffsetEnd>76</CharacterOffsetEnd>
<POS>IN</POS>
<NER>O</NER>
<Speaker>PER0</Speaker>
</token>
<token id="9">
<word>London</word>
<lemma>London</lemma>
<CharacterOffsetBegin>77</CharacterOffsetBegin>
<CharacterOffsetEnd>83</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>LOCATION</NER>
<Speaker>PER0</Speaker>
</token>
</tokens>
<parse>(ROOT (S (NP (PRP He)) (VP (VBZ manages) (NP (NP (CD 5) (NN aircraft)) (CC and) (NP (NP (PRP$ their) (NN crew)) (PP (IN in) (NP (NNP London)))))))) </parse>
<dependencies type="basic-dependencies">
<dep type="root">
<governor idx="0">ROOT</governor>
<dependent idx="2">manages</dependent>
</dep>
<dep type="nsubj">
<governor idx="2">manages</governor>
<dependent idx="1">He</dependent>
</dep>
<dep type="nummod">
<governor idx="4">aircraft</governor>
<dependent idx="3">5</dependent>
</dep>
<dep type="dobj">
<governor idx="2">manages</governor>
<dependent idx="4">aircraft</dependent>
</dep>
<dep type="cc">
<governor idx="4">aircraft</governor>
<dependent idx="5">and</dependent>
</dep>
<dep type="nmod:poss">
<governor idx="7">crew</governor>
<dependent idx="6">their</dependent>
</dep>
<dep type="conj">
<governor idx="4">aircraft</governor>
<dependent idx="7">crew</dependent>
</dep>
<dep type="case">
<governor idx="9">London</governor>
<dependent idx="8">in</dependent>
</dep>
<dep type="nmod">
<governor idx="7">crew</governor>
<dependent idx="9">London</dependent>
</dep>
</dependencies>
<dependencies type="collapsed-dependencies">
<dep type="root">
<governor idx="0">ROOT</governor>
<dependent idx="2">manages</dependent>
</dep>
<dep type="nsubj">
<governor idx="2">manages</governor>
<dependent idx="1">He</dependent>
</dep>
<dep type="nummod">
<governor idx="4">aircraft</governor>
<dependent idx="3">5</dependent>
</dep>
<dep type="dobj">
<governor idx="2">manages</governor>
<dependent idx="4">aircraft</dependent>
</dep>
<dep type="cc">
<governor idx="4">aircraft</governor>
<dependent idx="5">and</dependent>
</dep>
<dep type="nmod:poss">
<governor idx="7">crew</governor>
<dependent idx="6">their</dependent>
</dep>
<dep type="conj:and">
<governor idx="4">aircraft</governor>
<dependent idx="7">crew</dependent>
</dep>
<dep type="case">
<governor idx="9">London</governor>
<dependent idx="8">in</dependent>
</dep>
<dep type="nmod:in">
<governor idx="7">crew</governor>
<dependent idx="9">London</dependent>
</dep>
</dependencies>
<dependencies type="collapsed-ccprocessed-dependencies">
<dep type="root">
<governor idx="0">ROOT</governor>
<dependent idx="2">manages</dependent>
</dep>
<dep type="nsubj">
<governor idx="2">manages</governor>
<dependent idx="1">He</dependent>
</dep>
<dep type="nummod">
<governor idx="4">aircraft</governor>
<dependent idx="3">5</dependent>
</dep>
<dep type="dobj">
<governor idx="2">manages</governor>
<dependent idx="4">aircraft</dependent>
</dep>
<dep type="cc">
<governor idx="4">aircraft</governor>
<dependent idx="5">and</dependent>
</dep>
<dep type="nmod:poss">
<governor idx="7">crew</governor>
<dependent idx="6">their</dependent>
</dep>
<dep type="dobj" extra="true">
<governor idx="2">manages</governor>
<dependent idx="7">crew</dependent>
</dep>
<dep type="conj:and">
<governor idx="4">aircraft</governor>
<dependent idx="7">crew</dependent>
</dep>
<dep type="case">
<governor idx="9">London</governor>
<dependent idx="8">in</dependent>
</dep>
<dep type="nmod:in">
<governor idx="7">crew</governor>
<dependent idx="9">London</dependent>
</dep>
</dependencies>
</sentence>
</sentences>
<coreference>
<coreference>
<mention representative="true">
<sentence>1</sentence>
<start>1</start>
<end>3</end>
<head>2</head>
<text>Jack Frost</text>
</mention>
<mention>
<sentence>2</sentence>
<start>1</start>
<end>2</end>
<head>1</head>
<text>He</text>
</mention>
</coreference>
</coreference>
</document>
</root>
或者获取 json 输出:
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file test.txt -outputFormat json
如果您确实需要 python 包装器,请参阅 https://github.com/smilli/py-corenlp
$ cd stanford-corenlp-full-2015-12-09
$ export CLASSPATH=protobuf.jar:joda-time.jar:jollyday.jar:xom-1.2.10.jar:stanford-corenlp-3.6.0.jar:stanford-corenlp-3.6.0-models.jar:slf4j-api.jar
$ java -mx4g edu.stanford.nlp.pipeline.StanfordCoreNLPServer &
cd
$ git clone https://github.com/smilli/py-corenlp.git
$ cd py-corenlp
$ python
>>> from corenlp import StanfordCoreNLP
>>> nlp = StanfordCoreNLP('http://localhost:9000')
>>> text = ("Jack Frost works for Boeing Company. He manages 5 aircraft and their crew in London")
>>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner', 'outputFormat': 'json'})
>>> output
{u'sentences': [{u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE', u'index': 0, u'tokens': [{u'index': 1, u'word': u'Jack', u'lemma': u'Jack', u'after': u' ', u'pos': u'NNP', u'characterOffsetEnd': 4, u'characterOffsetBegin': 0, u'originalText': u'Jack', u'ner': u'PERSON', u'before': u''}, {u'index': 2, u'word': u'Frost', u'lemma': u'Frost', u'after': u' ', u'pos': u'NNP', u'characterOffsetEnd': 10, u'characterOffsetBegin': 5, u'originalText': u'Frost', u'ner': u'PERSON', u'before': u' '}, {u'index': 3, u'word': u'works', u'lemma': u'work', u'after': u' ', u'pos': u'VBZ', u'characterOffsetEnd': 16, u'characterOffsetBegin': 11, u'originalText': u'works', u'ner': u'O', u'before': u' '}, {u'index': 4, u'word': u'for', u'lemma': u'for', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 20, u'characterOffsetBegin': 17, u'originalText': u'for', u'ner': u'O', u'before': u' '}, {u'index': 5, u'word': u'Boeing', u'lemma': u'Boeing', u'after': u' ', u'pos': u'NNP', u'characterOffsetEnd': 27, u'characterOffsetBegin': 21, u'originalText': u'Boeing', u'ner': u'ORGANIZATION', u'before': u' '}, {u'index': 6, u'word': u'Company', u'lemma': u'Company', u'after': u'', u'pos': u'NNP', u'characterOffsetEnd': 35, u'characterOffsetBegin': 28, u'originalText': u'Company', u'ner': u'ORGANIZATION', u'before': u' '}, {u'index': 7, u'word': u'.', u'lemma': u'.', u'after': u' ', u'pos': u'.', u'characterOffsetEnd': 36, u'characterOffsetBegin': 35, u'originalText': u'.', u'ner': u'O', u'before': u''}]}, {u'parse': u'SENTENCE_SKIPPED_OR_UNPARSABLE', u'index': 1, u'tokens': [{u'index': 1, u'word': u'He', u'lemma': u'he', u'after': u' ', u'pos': u'PRP', u'characterOffsetEnd': 39, u'characterOffsetBegin': 37, u'originalText': u'He', u'ner': u'O', u'before': u' '}, {u'index': 2, u'word': u'manages', u'lemma': u'manage', u'after': u' ', u'pos': u'VBZ', u'characterOffsetEnd': 47, u'characterOffsetBegin': 40, u'originalText': u'manages', u'ner': u'O', u'before': u' '}, {u'index': 3, u'after': u' ', u'word': u'5', u'lemma': u'5', u'normalizedNER': u'5.0', u'pos': u'CD', u'characterOffsetEnd': 49, u'characterOffsetBegin': 48, u'originalText': u'5', u'ner': u'NUMBER', u'before': u' '}, {u'index': 4, u'word': u'aircraft', u'lemma': u'aircraft', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 58, u'characterOffsetBegin': 50, u'originalText': u'aircraft', u'ner': u'O', u'before': u' '}, {u'index': 5, u'word': u'and', u'lemma': u'and', u'after': u' ', u'pos': u'CC', u'characterOffsetEnd': 62, u'characterOffsetBegin': 59, u'originalText': u'and', u'ner': u'O', u'before': u' '}, {u'index': 6, u'word': u'their', u'lemma': u'they', u'after': u' ', u'pos': u'PRP$', u'characterOffsetEnd': 68, u'characterOffsetBegin': 63, u'originalText': u'their', u'ner': u'O', u'before': u' '}, {u'index': 7, u'word': u'crew', u'lemma': u'crew', u'after': u' ', u'pos': u'NN', u'characterOffsetEnd': 73, u'characterOffsetBegin': 69, u'originalText': u'crew', u'ner': u'O', u'before': u' '}, {u'index': 8, u'word': u'in', u'lemma': u'in', u'after': u' ', u'pos': u'IN', u'characterOffsetEnd': 76, u'characterOffsetBegin': 74, u'originalText': u'in', u'ner': u'O', u'before': u' '}, {u'index': 9, u'word': u'London', u'lemma': u'London', u'after': u'', u'pos': u'NNP', u'characterOffsetEnd': 83, u'characterOffsetBegin': 77, u'originalText': u'London', u'ner': u'LOCATION', u'before': u' '}]}]}
>>> annotated_sent0 = output['sentences'][0]
>>> for token in annotated_sent0['tokens']:
... print token['word'], token['lemma'], token['pos'], token['ner']
...
Jack Jack NNP PERSON
Frost Frost NNP PERSON
works work VBZ O
for for IN O
Boeing Boeing NNP ORGANIZATION
Company Company NNP ORGANIZATION
. . . O
这可能是您想要的输出:
>>> " ".join(token['lemma'] for token in annotated_sent0['tokens'])
Jack Frost work for Boeing Company
>>> " ".join(token['word'] for token in annotated_sent0['tokens'])
Jack Frost works for Boeing Company
如果你想要一个 NLTK 附带的包装器,那么你必须再等一会儿,直到 this issue已解决;P
关于python - nltk : How to prevent stemming of proper nouns,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34455749/
NLTK 感知器标记器的标记集是什么?预训练模型使用的语料库是什么? 我试图从NLTK网站上找到官方信息。但他们没有那个。 最佳答案 来自 https://github.com/nltk/nltk/p
我无法理解这两者之间的区别。不过,我了解到word_tokenize将Penn-Treebank用于标记化目的。但TweetTokenizer上的任何内容都不可用。对于哪种类型的数据,我应该使用Twe
我正在学习 NLTK 和我的 mac 工作正常,除非我在 FreqDist() 上遇到问题。 (我看到另一个关于 FreqDist() 的问题,但他收到了不同的错误消息。TypeError: unha
我尝试了正则表达式词干分析器,但我得到了数百个不相关的标记。我只是对“播放”词干感兴趣。这是我正在使用的代码: import nltk from nltk.book import * f = open
我正在尝试使用 NLTK 命名实体标记器来识别各种命名实体。在使用 Python 进行自然语言处理一书中,他们提供了常用命名实体的列表(表 7.4,如果有人好奇的话),其中包括:日期 6 月,2008
我有很多文本数据,我想进行分类。我逐 block 递增地获取这些数据(例如 500 个样本)。我想用这些 block 在 NLTK 中对 NaiveBayesClassifier 进行训练,但要进行零
我在尝试运行实体提取功能时遇到问题。我相信这是版本差异。以下工作示例在 2.0.4 中运行,但不在 3.0 中运行。我确实将一个函数调用:batch_ne_chunk 更改为:nltk.ne_chun
我正在使用 docker 运行一个使用 nltk、languagetool 等的 NLP 系统... 当我使用 docker-compose build --build-arg env=dev我收到警
我正在检查 NLTK 的命名实体识别功能。是否可以找出提取出的哪个关键字与原文最相关?另外,是否可以知道提取的关键字的类型(人/组织)? 最佳答案 如果你有一个训练有素的标注器,你可以先标注你的文本,
我用过这个代码: # Step 1 : TOKENIZE from nltk.tokenize import * words = word_tokenize(text) # Step 2 : POS
当我运行 nltk.gaac.demo() 时 如果我错过了什么,你能帮我吗?我收到以下错误。 我使用的是nltk 3.0.1 Python 3.4.1 (v3.4.1:c0e311e010fc, M
我刚刚读了一篇关于如何使用 MALLET 进行主题建模的精彩文章,但我在网上找不到任何将 MALLET 与 NLTK 进行比较的内容,而我已经有过一些经验。 它们之间的主要区别是什么? MALLET
我试过这个,但它不起作用 from nltk.corpus import stopwords stopwords_list = stopwords.words('arabic') print(stop
我正在构建一个同时使用 NLTK 和 Spacy 的应用程序,并通过 Poetry 管理依赖项。我可以通过将此行添加到我的 pyproject.toml 来下载 Spacy 数据。下 [tool.po
我正在尝试使用 RegexpTokenizer 对文本进行分词。 代码: from nltk.tokenize import RegexpTokenizer #from nltk.tokenize i
我很好奇是否有人熟悉使用 NLTK's BLEU score calculation 之间的区别和 SacreBLEU library . 特别是,我使用了两个库的句子 BLEU 分数,对整个数据集进
我正在使用 nltk.word_tokenize用于标记一些包含编程语言、框架等的句子,这些句子被错误标记。 例如: >>> tokenize.word_tokenize("I work with C
我无法理解两者之间的区别。不过,我开始知道 word_tokenize 使用 Penn-Treebank 进行标记化。但是 TweetTokenizer 上没有任何内容可用。对于哪种数据,我应该使用
我需要对多种语言的文本进行名称实体提取:西类牙语、葡萄牙语、希腊语、捷克语、中文。 是否有这两个功能的所有支持语言的列表?是否有使用其他语料库的方法,以便可以包含这些语言? 最佳答案 默认情况下,这两
我是 python 的新手并使用 nltk,所以实际上我有一个非常基本的问题,但在任何地方都找不到答案。 我想知道什么时候在 nltk 模块的函数之前使用 nltk.。我正在处理一些任务,在某些情况下
我是一名优秀的程序员,十分优秀!