gpt4 book ai didi

java - 用于java的斯坦福nlp api : how to get the name as full not in parts

转载 作者:行者123 更新时间:2023-11-30 06:27:42 25 4
gpt4 key购买 nike

我的代码的目的是提交一个文档(无论是pdf还是doc文件)并获取其中的所有文本。给出要由 stanford nlp 分析的文本。该代码工作得很好。但假设文档中有名称,例如:“Pardeep Kumar”。收到的输出如下:

Pardeep NNP PERSON

Kumar NNP PERSON

但我希望它是这样的:

Pardeep Kumar NNP PERSON

我该怎么做?我如何检查两个相邻的单词实际上构成一个名称或类似的名称?我怎样才能不让它们被分成不同的单词?

这是我的代码:

public class readstuff {

public static void analyse(String data) {

// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);


// create an empty Annotation just with the given text
Annotation document = new Annotation(data);

// run all Annotators on this text
pipeline.annotate(document);

List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

// System.out.println("word"+"\t"+"POS"+"\t"+"NER");
for (CoreMap sentence : sentences) {

// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods

for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(CoreAnnotations.TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);

if(ne.equals("PERSON") || ne.equals("LOCATION") || ne.equals("DATE") )
{

System.out.format("%32s%10s%16s",word,pos,ne);
System.out.println();
//System.out.println(word +" \t"+pos +"\t"+ne);
}

}
}
}

public static void main(String[] args) throws FileNotFoundException, IOException, TransformerConfigurationException{

JFileChooser window=new JFileChooser();
int a=window.showOpenDialog(null);

if(a==JFileChooser.APPROVE_OPTION){
String name=window.getSelectedFile().getName();
String extension = name.substring(name.lastIndexOf(".") + 1, name.length());
String data = null;

if(extension.equals("docx")){
XWPFDocument doc=new XWPFDocument(new FileInputStream(window.getSelectedFile()));
XWPFWordExtractor extract= new XWPFWordExtractor(doc);
//System.out.println("docx file reading...");
data=extract.getText();
//extract.getMetadataTextExtractor();
}
else if(extension.equals("doc")){
HWPFDocument doc=new HWPFDocument(new FileInputStream(window.getSelectedFile()));
WordExtractor extract= new WordExtractor(doc);
//System.out.println("doc file reading...");
data=extract.getText();
}
else if(extension.equals("pdf")){
//System.out.println(window.getSelectedFile());
PdfReader reader=new PdfReader(new FileInputStream(window.getSelectedFile()));
int n=reader.getNumberOfPages();
for(int i=1;i<n;i++)
{
//System.out.println(data);
data=data+PdfTextExtractor.getTextFromPage(reader,i );
}
}
else{
System.out.println("format not supported");
}

analyse(data);
}
}



}

最佳答案

您想要使用entitymentions注释器。

package edu.stanford.nlp.examples;

import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;

import java.util.*;

public class EntityMentionsExample {

public static void main(String[] args) {
Annotation document =
new Annotation("John Smith visited Los Angeles on Tuesday. He left Los Angeles on Wednesday.");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);

for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {
System.out.println(entityMention);
System.out.println(entityMention.get(CoreAnnotations.EntityTypeAnnotation.class));
}
}
}
}

关于java - 用于java的斯坦福nlp api : how to get the name as full not in parts,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46787542/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com