gpt4 book ai didi

java - 来自 Stanford-POS-Tagger 的一个令人惊讶的标签

转载 作者:行者123 更新时间:2023-11-29 07:36:21 24 4
gpt4 key购买 nike

我在以下文本中使用了 Stanford-POS-Tagger(来自印度时报关于印度 super 联赛球员拍卖的新闻报道):

Royal Challengers Bangalore are used to making strong statements at the Indian Premier League auctions and they did so again on Saturday (February 6) with the marquee signing of seasoned Australian all-rounder Shane Watson. The staggering Rs 9.5 crore that the team paid for the 34-year-old made him the costliest buy this year.

The Vijay Mallya-owned side fought off stiff competition from new entrants Rising Pune Supergiants and defending champions Mumbai Indians to snare the former Rajasthan Royals star. Watson, a battling right-handed batsman and handy medium-pacer, will add serious bite to the Virat Kohli-led Bengaluru side still chasing their maiden title.

对于最后一句话,在 II-para 中,Stanford-POS-Tagger 将第一个单词 'Watson' 标记为基本动词!我搜索了 Chambers' Twentieth Century Dictionary 以查看单词 'watson' 是否是动词,但我找不到这样的条目!

我在代码中运行的一些函数得到了以下输出:

Watson,VB aDT battlingVBG rightJJ handedNN batsmanNN andCC handyJJ mediumNN pacer,NN willMD addVB seriousJJ biteNN toTO theDT ViratNNP KohliNNP ledVBD BengaluruNNP sideNN stillRB chasingVBG theirPRP$ maidenJJ title.NN

最佳答案

问题似乎是您在 POS 标记之前没有标记您的文本。

如@ChristopherManning 所示,如果您在标记之前对文本进行标记化,则 Stanford POS 标记器的输出将是正确的。

在命令行上使用 CoreNLP:

alvas@ubi:~/stanford-corenlp-full-2015-12-09$ echo """Royal Challengers Bangalore are used to making strong statements at the Indian Premier League auctions and they did so again on Saturday (February 6) with the marquee signing of seasoned Australian all-rounder Shane Watson. The staggering Rs 9.5 crore that the team paid for the 34-year-old made him the costliest buy this year.

The Vijay Mallya-owned side fought off stiff competition from new entrants Rising Pune Supergiants and defending champions Mumbai Indians to snare the former Rajasthan Royals star. Watson, a battling right-handed batsman and handy medium-pacer, will add serious bite to the Virat Kohli-led Bengaluru side still chasing their maiden title.""" > watson.txt
alvas@ubi:~/stanford-corenlp-full-2015-12-09$
alvas@ubi:~/stanford-corenlp-full-2015-12-09$ java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -outputFormat json -file watson.txt
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.6 sec].

Processing file /home/alvas/stanford-corenlp-full-2015-12-09/watson.txt ... writing to /home/alvas/stanford-corenlp-full-2015-12-09/watson.txt.json
Annotating file /home/alvas/stanford-corenlp-full-2015-12-09/watson.txt
done.
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.1 sec.
TOTAL: 0.1 sec. for 110 tokens at 791.4 tokens/sec.
Pipeline setup: 1.6 sec.
Total time for StanfordCoreNLP pipeline: 1.9 sec

输出将保存在 watson.txt.json 中并进行一些修改:

>>> import json
>>> with open('watson.txt.json') as fin:
... output = json.load(fin)
...
>>> for sent in output['sentences']:
... print ' '.join([tok['word']+'/'+tok['pos'] for tok in sent['tokens']]) + '\n'
...

Royal/NNP Challengers/NNS Bangalore/NNP are/VBP used/VBN to/TO making/VBG strong/JJ statements/NNS at/IN the/DT Indian/JJ Premier/NNP League/NNP auctions/NNS and/CC they/PRP did/VBD so/RB again/RB on/IN Saturday/NNP -LRB-/-LRB- February/NNP 6/CD -RRB-/-RRB- with/IN the/DT marquee/JJ

signing/NN of/IN seasoned/JJ Australian/JJ all-rounder/NN Shane/NNP Watson/NNP ./.

The/DT staggering/JJ Rs/NN 9.5/CD crore/VBP that/IN the/DT team/NN paid/VBN for/IN the/DT 34-year-old/JJ made/VBD him/PRP the/DT costliest/JJS buy/VB this/DT year/NN ./.

The/DT Vijay/NNP Mallya-owned/JJ side/NN fought/VBD off/RP stiff/JJ competition/NN from/IN new/JJ entrants/NNS Rising/VBG Pune/NNP Supergiants/NNPS and/CC defending/VBG champions/NNS Mumbai/NNP Indians/NNPS to/TO snare/VB the/DT former/JJ Rajasthan/NNP Royals/NNPS star/NN ./.

Watson/NNP ,/, a/DT battling/VBG right-handed/JJ batsman/NN and/CC handy/JJ medium-pacer/NN ,/, will/MD add/VB serious/JJ bite/NN to/TO the/DT Virat/NNP Kohli-led/NNP Bengaluru/NNP side/NN still/RB chasing/VBG their/PRP$ maiden/JJ title/NN ./.

请注意,如果您在命令行上使用 Stanford CoreNLP,它将不允许允许您在没有标记化的情况下使用 POS 标签:

alvas@ubi:~/stanford-corenlp-full-2015-12-09$ java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators pos -outputFormat json -file watson.txt[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.4 sec].
Exception in thread "main" java.lang.IllegalArgumentException: annotator "pos" requires annotator "tokenize"
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:375)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:139)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:135)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1214)

无论您是通过 GUI、命令行、python API 还是直接通过在 Java 代码中导入该库来使用 Stanford 词性标注器,建议对您的文本进行句子标记,然后在每个句子之前对每个句子进行单词标记POS 标记它们。

Stanford CoreNLP API 提供了一个示例,说明如何使用 Java 注释数据:http://stanfordnlp.github.io/CoreNLP/api.html

关于java - 来自 Stanford-POS-Tagger 的一个令人惊讶的标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35362808/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com