java - 从 stanford corenlp 的大数据集中查找特征-6ren

java - 从 stanford corenlp 的大数据集中查找特征

转载作者：行者123 更新时间：2023-11-30 07:43:40

24

4

我是斯坦福 NLP 的新人。我找不到任何好的、完整的文档或教程。我的工作是做情感分析。我有一个非常大的产品评论数据集。我已经根据用户给出的“开始”区分了它们的积极和消极。现在我需要找到最常出现的正面和负面形容词作为我的算法的特征。我从 here 了解如何进行分词、词形还原和词性标记。我有这样的文件。

评论是

Don't waste your money. This is a short DVD and the host is boring and offers information that is common sense to any idiot. Pass on this and buy something else. Very generic

输出是。

Sentence #1 (6 tokens):
Don't waste your money.
[Text=Do CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=VBP Lemma=do]
[Text=n't CharacterOffsetBegin=2 CharacterOffsetEnd=5 PartOfSpeech=RB Lemma=not]
[Text=waste CharacterOffsetBegin=6 CharacterOffsetEnd=11 PartOfSpeech=VB Lemma=waste]
[Text=your CharacterOffsetBegin=12 CharacterOffsetEnd=16 PartOfSpeech=PRP$ Lemma=you]
[Text=money CharacterOffsetBegin=17 CharacterOffsetEnd=22 PartOfSpeech=NN Lemma=money]
[Text=. CharacterOffsetBegin=22 CharacterOffsetEnd=23 PartOfSpeech=. Lemma=.]
Sentence #2 (21 tokens):
This is a short DVD and the host is boring and offers information that is common sense to any idiot.
[Text=This CharacterOffsetBegin=24 CharacterOffsetEnd=28 PartOfSpeech=DT Lemma=this]
[Text=is CharacterOffsetBegin=29 CharacterOffsetEnd=31 PartOfSpeech=VBZ Lemma=be]
[Text=a CharacterOffsetBegin=32 CharacterOffsetEnd=33 PartOfSpeech=DT Lemma=a]
[Text=short CharacterOffsetBegin=34 CharacterOffsetEnd=39 PartOfSpeech=JJ Lemma=short]
[Text=DVD CharacterOffsetBegin=40 CharacterOffsetEnd=43 PartOfSpeech=NN Lemma=dvd]
[Text=and CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=CC Lemma=and]
[Text=the CharacterOffsetBegin=48 CharacterOffsetEnd=51 PartOfSpeech=DT Lemma=the]
[Text=host CharacterOffsetBegin=52 CharacterOffsetEnd=56 PartOfSpeech=NN Lemma=host]
[Text=is CharacterOffsetBegin=57 CharacterOffsetEnd=59 PartOfSpeech=VBZ Lemma=be]
[Text=boring CharacterOffsetBegin=60 CharacterOffsetEnd=66 PartOfSpeech=JJ Lemma=boring]
[Text=and CharacterOffsetBegin=67 CharacterOffsetEnd=70 PartOfSpeech=CC Lemma=and]
[Text=offers CharacterOffsetBegin=71 CharacterOffsetEnd=77 PartOfSpeech=VBZ Lemma=offer]
[Text=information CharacterOffsetBegin=78 CharacterOffsetEnd=89 PartOfSpeech=NN Lemma=information]
[Text=that CharacterOffsetBegin=90 CharacterOffsetEnd=94 PartOfSpeech=WDT Lemma=that]
[Text=is CharacterOffsetBegin=95 CharacterOffsetEnd=97 PartOfSpeech=VBZ Lemma=be]
[Text=common CharacterOffsetBegin=98 CharacterOffsetEnd=104 PartOfSpeech=JJ Lemma=common]
[Text=sense CharacterOffsetBegin=105 CharacterOffsetEnd=110 PartOfSpeech=NN Lemma=sense]
[Text=to CharacterOffsetBegin=111 CharacterOffsetEnd=113 PartOfSpeech=TO Lemma=to]
[Text=any CharacterOffsetBegin=114 CharacterOffsetEnd=117 PartOfSpeech=DT Lemma=any]
[Text=idiot CharacterOffsetBegin=118 CharacterOffsetEnd=123 PartOfSpeech=NN Lemma=idiot]
[Text=. CharacterOffsetBegin=123 CharacterOffsetEnd=124 PartOfSpeech=. Lemma=.]
Sentence #3 (8 tokens):
Pass on this and buy something else.
[Text=Pass CharacterOffsetBegin=125 CharacterOffsetEnd=129 PartOfSpeech=VB Lemma=pass]
[Text=on CharacterOffsetBegin=130 CharacterOffsetEnd=132 PartOfSpeech=IN Lemma=on]
[Text=this CharacterOffsetBegin=133 CharacterOffsetEnd=137 PartOfSpeech=DT Lemma=this]
[Text=and CharacterOffsetBegin=138 CharacterOffsetEnd=141 PartOfSpeech=CC Lemma=and]
[Text=buy CharacterOffsetBegin=142 CharacterOffsetEnd=145 PartOfSpeech=VB Lemma=buy]
[Text=something CharacterOffsetBegin=146 CharacterOffsetEnd=155 PartOfSpeech=NN Lemma=something]
[Text=else CharacterOffsetBegin=156 CharacterOffsetEnd=160 PartOfSpeech=RB Lemma=else]
[Text=. CharacterOffsetBegin=160 CharacterOffsetEnd=161 PartOfSpeech=. Lemma=.]
Sentence #4 (2 tokens):
Very generic
[Text=Very CharacterOffsetBegin=162 CharacterOffsetEnd=166 PartOfSpeech=RB Lemma=very]
[Text=generic CharacterOffsetBegin=167 CharacterOffsetEnd=174 PartOfSpeech=JJ Lemma=generic]

我已经像这样处理了 10000 个正片和 10000 个负片文件。现在我怎样才能轻松找到最常出现的正面和负面特征(形容词)？我是否需要读取所有输出(已处理)文件并制作这样的形容词的列表频率计数，或者 stanford corenlp 有什么简单的方法吗？

最佳答案

以下是处理带注释的评论并将形容词存储在计数器中的示例。

在示例中，电影评论“这部电影很棒!这是一部很棒的电影。”有“积极”的情绪。

我建议更改我的代码以加载到每个文件中，并使用文件的文本构建注释并记录该文件的情绪。

然后您可以浏览每个文件并为每个形容词建立一个包含正数和负数的计数器。

最终的计数器有形容词“great”，计数为 2。

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.stats.Counter;
import edu.stanford.nlp.stats.ClassicCounter;
import edu.stanford.nlp.util.CoreMap;

import java.util.Properties;

public class AdjectiveSentimentExample {

    public static void main(String[] args) throws Exception {

        Counter<String> adjectivePositiveCounts = new ClassicCounter<String>();
        Counter<String> adjectiveNegativeCounts = new ClassicCounter<String>();

        Annotation review = new Annotation("The movie was great!  It was a great film.");
        String sentiment = "positive";

        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        pipeline.annotate(review);
        for (CoreMap sentence : review.get(CoreAnnotations.SentencesAnnotation.class)) {
            for (CoreLabel cl : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                if (cl.get(CoreAnnotations.PartOfSpeechAnnotation.class).equals("JJ")) {
                    if (sentiment.equals("positive")) {
                        adjectivePositiveCounts.incrementCount(cl.word());
                    } else if (sentiment.equals("negative")) {
                        adjectiveNegativeCounts.incrementCount(cl.word());
                    }
                }

            }
        }

        System.out.println("---");
        System.out.println("positive adjective counts");
        System.out.println(adjectivePositiveCounts);
    }
}

关于java - 从 stanford corenlp 的大数据集中查找特征，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34252507/

24

4

0

文章推荐： java - 区分父类(super class)和子类之间的变量

文章推荐： javascript - jquery 中的数据表

文章推荐： php - 使用 jquery.post 单向发送一个帖子

文章推荐： java - 无法遍历Jsp中的列表

stanford-nlp - Stanford Parser的标签
我刚开始使用Stanford Parser，但我不太了解这些标签。这可能是一个愚蠢的问题，但是谁能告诉我SBARQ和SQ标签代表什么，在哪里可以找到它们的完整列表？我知道Penn Treebank的样
stanford-nlp - nltk stanford ner tagger 和 stanford ner tagger 在线演示之间的不一致
我正在使用 python 的内置库 nltk 来获取 stanford ner tagger api 设置，但我发现此 api 的单词标记与 stanford 的 ner tagger 网站上的在线演
stanford-nlp - 初始堆错误太小 - stanford parser
我正在尝试使用斯坦福依赖解析器。我尝试从 Windows 上的命令行运行解析器以使用以下命令提取依赖项: java -mx100m -cp "stanford-parser.jar" edu.stan
stanford-nlp - Stanford CoreNLP BasicPipelineExample 不起作用
我正在尝试开始使用 Stanford CoreNLP，甚至无法通过这里的第一个简单示例。 https://stanfordnlp.github.io/CoreNLP/api.html 这是我的代码:
stanford-nlp - 用 stanford-nlp 分块一些文本
我正在使用 stanford 核心 NLP，并使用这一行来加载一些模块来处理我的文本: props.put("annotators", "tokenize, ssplit, pos, lemma, n
stanford-nlp - Stanford Core NLP 是否支持德语词形还原？
我找到了与 Stanford Core NLP 兼容的德语解析和 pos-tag 模型。但是我无法使德语词形还原工作。有办法吗？最佳答案抱歉，据我所知，Stanford CoreNLP 不存在德语
stanford-nlp - 是否可以选择从 Stanford Parser 获取每个句子的处理时间？
我目前正在使用以下命令解析阿拉伯文本: java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser \ -cp "$scri
stanford-nlp - 为什么 stanford corenlp 性别识别是不确定的？
我有以下结果，如您所见，爱德华这个名字有不同的结果(null 和 male)。这发生在几个名字上。 edward, Gender: null james, Gender: MALE karla, Ge
stanford-nlp - 如何基于 stanford-nlp 条件随机场模型训练法国 NER？
我发现了 stanford-NLP 的工具，发现它真的很有趣。我是一名法国数据挖掘者/数据科学家，喜欢文本分析，并且很想使用您的工具，但是 NER 在法语中不可用，这让我感到非常困惑。我很想制作我
c++ - linux 上的 Stanford Stanford C++ 库
我正在使用 Suse Linux 13.1 并自学斯坦福大学的 CS 106b 类(class)。我在这里找到了压缩库 http://www.stanford.edu/class/cs106b/hom
nlp - 如何使用 Stanford Parser 或 Stanford CoreNLP 找到名词短语的语法关系
我正在使用 stanford CoreNLP 来尝试查找名词短语的语法关系。这是一个例子: 给定“The fitness room was dirty”这句话。我成功地将“The fitness
stanford-nlp - 格式化 Stanford Corenlp 的 NER 输出
我正在使用 Stanford CoreNLP 并将其用于 NER。但是当我提取组织名称时，我看到每个词都标有注释。因此，如果实体是“纽约时报”，那么它将被记录为三个不同的实体:“NEW”、“YORK”
stanford-nlp - stanford corenlp 3.3.1 语言支持
我开始使用 coreNLP 库 3.3.1 来分析意大利文本文档。有没有人尝试过使用英语以外的语言？您是否找到了训练算法所需的模型？谢谢卡罗最佳答案目前，除了英语，我们只为中文打包模型(见 h
stanford-nlp - 使用 Core NLP 和 Stanford Parser 执行词性标注的结果不同？
斯坦福解析器和斯坦福 CoreNlp 的词性 (POS) 模型用途不同，这就是为什么通过 Stanford Parser 和 CoreNlp 执行的 POS 标记的输出存在差异。在线核心 NLP 输
java - Stanford-CoreNLP 和 Stanford-Parser 中的 Maven 类名冲突
我的 (maven) 项目依赖于 stanford-CoreNLP 和 stanford-Parser，显然每个依赖项的(词汇化)解析器产生不同的输出，它们并不相同。我的问题是如何确定应该从哪个包加
c# - Stanford CoreNLP 创建 edu.stanford.nlp.time.TimeExpressionExtractorImpl 时出错
我正在尝试学习 Stanford CoreNLP 库。我在发布的示例 ( https://sergeytihon.wordpress.com/2013/10/26/stanford-corenlp-i
java - 无法在 .\stanford-corenlp-4.0.0 找到 stanford-parser\.jar jar 文件
我是 nltk 的新手，似乎正在遵循过时的教程来开始使用 nltk 中的 StanleyDependencyParser。我已经从https://stanfordnlp.github.io/安装了S
java - Stanford Core NLP ner 4.0.0错误: Could not find or load main class stanford-ner.jar;lib.*
我正在尝试使用Stanford CoreNLP训练NER模型，但是找不到主类。我已经在我的CLASSPATH中包含了jar文件的路径，但仍然找不到它们。有什么办法解决这个问题吗？ C:\ Users
scala - 类型不匹配;找到 : edu. stanford.nlp.util.CoreMap => 需要单位 : java. util.function.Consumer[_> : edu. stanford.nlp.util.CoreMap]
我不明白它要我做什么。分配给 sentence正在工作: val sentences : java.util.List[CoreMap] = document.get(classOf[Sentence
stanford-nlp - 斯坦福NLP训练情感模型
我正在参加 Rotten Tomatoes NLP 预测的 kaggle 竞赛。训练集格式解析如下: PhraseId SentenceId Phrase Sentiment 1 1 A serie

首页

博学

6Ren·AI

商城

java - 从 stanford corenlp 的大数据集中查找特征