java - 来自 Stanford-POS-Tagger 的一个令人惊讶的标签-6ren

java - 来自 Stanford-POS-Tagger 的一个令人惊讶的标签

转载作者：行者123 更新时间：2023-11-29 07:36:21

24

4

我在以下文本中使用了 Stanford-POS-Tagger(来自印度时报关于印度 super 联赛球员拍卖的新闻报道):

Royal Challengers Bangalore are used to making strong statements at the Indian Premier League auctions and they did so again on Saturday (February 6) with the marquee signing of seasoned Australian all-rounder Shane Watson. The staggering Rs 9.5 crore that the team paid for the 34-year-old made him the costliest buy this year.

The Vijay Mallya-owned side fought off stiff competition from new entrants Rising Pune Supergiants and defending champions Mumbai Indians to snare the former Rajasthan Royals star. Watson, a battling right-handed batsman and handy medium-pacer, will add serious bite to the Virat Kohli-led Bengaluru side still chasing their maiden title.

对于最后一句话，在 II-para 中，Stanford-POS-Tagger 将第一个单词 'Watson' 标记为基本动词!我搜索了 Chambers' Twentieth Century Dictionary 以查看单词 'watson' 是否是动词，但我找不到这样的条目!

我在代码中运行的一些函数得到了以下输出:

Watson,VB aDT battlingVBG rightJJ handedNN batsmanNN andCC handyJJ mediumNN pacer,NN willMD addVB seriousJJ biteNN toTO theDT ViratNNP KohliNNP ledVBD BengaluruNNP sideNN stillRB chasingVBG theirPRP$ maidenJJ title.NN

最佳答案

问题似乎是您在 POS 标记之前没有标记您的文本。

如@ChristopherManning 所示，如果您在标记之前对文本进行标记化，则 Stanford POS 标记器的输出将是正确的。

在命令行上使用 CoreNLP:

alvas@ubi:~/stanford-corenlp-full-2015-12-09$ echo """Royal Challengers Bangalore are used to making strong statements at the Indian Premier League auctions and they did so again on Saturday (February 6) with the marquee signing of seasoned Australian all-rounder Shane Watson. The staggering Rs 9.5 crore that the team paid for the 34-year-old made him the costliest buy this year.

The Vijay Mallya-owned side fought off stiff competition from new entrants Rising Pune Supergiants and defending champions Mumbai Indians to snare the former Rajasthan Royals star. Watson, a battling right-handed batsman and handy medium-pacer, will add serious bite to the Virat Kohli-led Bengaluru side still chasing their maiden title.""" > watson.txt
alvas@ubi:~/stanford-corenlp-full-2015-12-09$ 
alvas@ubi:~/stanford-corenlp-full-2015-12-09$ java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -outputFormat json -file watson.txt
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.6 sec].

Processing file /home/alvas/stanford-corenlp-full-2015-12-09/watson.txt ... writing to /home/alvas/stanford-corenlp-full-2015-12-09/watson.txt.json
Annotating file /home/alvas/stanford-corenlp-full-2015-12-09/watson.txt
done.
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.1 sec.
TOTAL: 0.1 sec. for 110 tokens at 791.4 tokens/sec.
Pipeline setup: 1.6 sec.
Total time for StanfordCoreNLP pipeline: 1.9 sec

输出将保存在 watson.txt.json 中并进行一些修改:

>>> import json
>>> with open('watson.txt.json') as fin:
...     output = json.load(fin)
... 
>>> for sent in output['sentences']:
...     print ' '.join([tok['word']+'/'+tok['pos'] for tok in sent['tokens']]) + '\n'
... 

Royal/NNP Challengers/NNS Bangalore/NNP are/VBP used/VBN to/TO making/VBG strong/JJ statements/NNS at/IN the/DT Indian/JJ Premier/NNP League/NNP auctions/NNS and/CC they/PRP did/VBD so/RB again/RB on/IN Saturday/NNP -LRB-/-LRB- February/NNP 6/CD -RRB-/-RRB- with/IN the/DT marquee/JJ 

signing/NN of/IN seasoned/JJ Australian/JJ all-rounder/NN Shane/NNP Watson/NNP ./.

The/DT staggering/JJ Rs/NN 9.5/CD crore/VBP that/IN the/DT team/NN paid/VBN for/IN the/DT 34-year-old/JJ made/VBD him/PRP the/DT costliest/JJS buy/VB this/DT year/NN ./.

The/DT Vijay/NNP Mallya-owned/JJ side/NN fought/VBD off/RP stiff/JJ competition/NN from/IN new/JJ entrants/NNS Rising/VBG Pune/NNP Supergiants/NNPS and/CC defending/VBG champions/NNS Mumbai/NNP Indians/NNPS to/TO snare/VB the/DT former/JJ Rajasthan/NNP Royals/NNPS star/NN ./.

Watson/NNP ,/, a/DT battling/VBG right-handed/JJ batsman/NN and/CC handy/JJ medium-pacer/NN ,/, will/MD add/VB serious/JJ bite/NN to/TO the/DT Virat/NNP Kohli-led/NNP Bengaluru/NNP side/NN still/RB chasing/VBG their/PRP$ maiden/JJ title/NN ./.

请注意，如果您在命令行上使用 Stanford CoreNLP，它将不允许允许您在没有标记化的情况下使用 POS 标签:

alvas@ubi:~/stanford-corenlp-full-2015-12-09$ java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators pos -outputFormat json -file watson.txt[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.4 sec].
Exception in thread "main" java.lang.IllegalArgumentException: annotator "pos" requires annotator "tokenize"
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:375)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:139)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:135)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1214)

无论您是通过 GUI、命令行、python API 还是直接通过在 Java 代码中导入该库来使用 Stanford 词性标注器，建议对您的文本进行句子标记，然后在每个句子之前对每个句子进行单词标记POS 标记它们。

Stanford CoreNLP API 提供了一个示例，说明如何使用 Java 注释数据:http://stanfordnlp.github.io/CoreNLP/api.html

关于java - 来自 Stanford-POS-Tagger 的一个令人惊讶的标签，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35362808/

24

4

0

文章推荐： javascript - 从客户端上的 javascript 调用 Java API(无 Web 服务)

文章推荐： mysql - 在mysql中查询到jooq

文章推荐： python - Django get_or_create 创建相关对象时的竞争条件

文章推荐： java - 如何将 for 循环的索引变量传递给匿名 Thread/Runnable

stanford-nlp - Stanford Parser的标签
我刚开始使用Stanford Parser，但我不太了解这些标签。这可能是一个愚蠢的问题，但是谁能告诉我SBARQ和SQ标签代表什么，在哪里可以找到它们的完整列表？我知道Penn Treebank的样
stanford-nlp - nltk stanford ner tagger 和 stanford ner tagger 在线演示之间的不一致
我正在使用 python 的内置库 nltk 来获取 stanford ner tagger api 设置，但我发现此 api 的单词标记与 stanford 的 ner tagger 网站上的在线演
stanford-nlp - 初始堆错误太小 - stanford parser
我正在尝试使用斯坦福依赖解析器。我尝试从 Windows 上的命令行运行解析器以使用以下命令提取依赖项: java -mx100m -cp "stanford-parser.jar" edu.stan
stanford-nlp - Stanford CoreNLP BasicPipelineExample 不起作用
我正在尝试开始使用 Stanford CoreNLP，甚至无法通过这里的第一个简单示例。 https://stanfordnlp.github.io/CoreNLP/api.html 这是我的代码:
stanford-nlp - 用 stanford-nlp 分块一些文本
我正在使用 stanford 核心 NLP，并使用这一行来加载一些模块来处理我的文本: props.put("annotators", "tokenize, ssplit, pos, lemma, n
stanford-nlp - Stanford Core NLP 是否支持德语词形还原？
我找到了与 Stanford Core NLP 兼容的德语解析和 pos-tag 模型。但是我无法使德语词形还原工作。有办法吗？最佳答案抱歉，据我所知，Stanford CoreNLP 不存在德语
stanford-nlp - 是否可以选择从 Stanford Parser 获取每个句子的处理时间？
我目前正在使用以下命令解析阿拉伯文本: java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser \ -cp "$scri
stanford-nlp - 为什么 stanford corenlp 性别识别是不确定的？
我有以下结果，如您所见，爱德华这个名字有不同的结果(null 和 male)。这发生在几个名字上。 edward, Gender: null james, Gender: MALE karla, Ge
stanford-nlp - 如何基于 stanford-nlp 条件随机场模型训练法国 NER？
我发现了 stanford-NLP 的工具，发现它真的很有趣。我是一名法国数据挖掘者/数据科学家，喜欢文本分析，并且很想使用您的工具，但是 NER 在法语中不可用，这让我感到非常困惑。我很想制作我
c++ - linux 上的 Stanford Stanford C++ 库
我正在使用 Suse Linux 13.1 并自学斯坦福大学的 CS 106b 类(class)。我在这里找到了压缩库 http://www.stanford.edu/class/cs106b/hom
nlp - 如何使用 Stanford Parser 或 Stanford CoreNLP 找到名词短语的语法关系
我正在使用 stanford CoreNLP 来尝试查找名词短语的语法关系。这是一个例子: 给定“The fitness room was dirty”这句话。我成功地将“The fitness
stanford-nlp - 格式化 Stanford Corenlp 的 NER 输出
我正在使用 Stanford CoreNLP 并将其用于 NER。但是当我提取组织名称时，我看到每个词都标有注释。因此，如果实体是“纽约时报”，那么它将被记录为三个不同的实体:“NEW”、“YORK”
stanford-nlp - stanford corenlp 3.3.1 语言支持
我开始使用 coreNLP 库 3.3.1 来分析意大利文本文档。有没有人尝试过使用英语以外的语言？您是否找到了训练算法所需的模型？谢谢卡罗最佳答案目前，除了英语，我们只为中文打包模型(见 h
stanford-nlp - 使用 Core NLP 和 Stanford Parser 执行词性标注的结果不同？
斯坦福解析器和斯坦福 CoreNlp 的词性 (POS) 模型用途不同，这就是为什么通过 Stanford Parser 和 CoreNlp 执行的 POS 标记的输出存在差异。在线核心 NLP 输
java - Stanford-CoreNLP 和 Stanford-Parser 中的 Maven 类名冲突
我的 (maven) 项目依赖于 stanford-CoreNLP 和 stanford-Parser，显然每个依赖项的(词汇化)解析器产生不同的输出，它们并不相同。我的问题是如何确定应该从哪个包加
c# - Stanford CoreNLP 创建 edu.stanford.nlp.time.TimeExpressionExtractorImpl 时出错
我正在尝试学习 Stanford CoreNLP 库。我在发布的示例 ( https://sergeytihon.wordpress.com/2013/10/26/stanford-corenlp-i
java - 无法在 .\stanford-corenlp-4.0.0 找到 stanford-parser\.jar jar 文件
我是 nltk 的新手，似乎正在遵循过时的教程来开始使用 nltk 中的 StanleyDependencyParser。我已经从https://stanfordnlp.github.io/安装了S
java - Stanford Core NLP ner 4.0.0错误: Could not find or load main class stanford-ner.jar;lib.*
我正在尝试使用Stanford CoreNLP训练NER模型，但是找不到主类。我已经在我的CLASSPATH中包含了jar文件的路径，但仍然找不到它们。有什么办法解决这个问题吗？ C:\ Users
scala - 类型不匹配;找到 : edu. stanford.nlp.util.CoreMap => 需要单位 : java. util.function.Consumer[_> : edu. stanford.nlp.util.CoreMap]
我不明白它要我做什么。分配给 sentence正在工作: val sentences : java.util.List[CoreMap] = document.get(classOf[Sentence
stanford-nlp - 斯坦福NLP训练情感模型
我正在参加 Rotten Tomatoes NLP 预测的 kaggle 竞赛。训练集格式解析如下: PhraseId SentenceId Phrase Sentiment 1 1 A serie

首页

博学

6Ren·AI

商城

java - 来自 Stanford-POS-Tagger 的一个令人惊讶的标签