gpt4 book ai didi

edu.stanford.nlp.process.WordToSentenceProcessor.()方法的使用及代码示例

转载 作者:知者 更新时间:2024-03-24 00:03:05 27 4
gpt4 key购买 nike

本文整理了Java中edu.stanford.nlp.process.WordToSentenceProcessor.<init>()方法的一些代码示例,展示了WordToSentenceProcessor.<init>()的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。WordToSentenceProcessor.<init>()方法的具体详情如下:
包路径:edu.stanford.nlp.process.WordToSentenceProcessor
类名称:WordToSentenceProcessor
方法名:<init>

WordToSentenceProcessor.<init>介绍

[英]Create a WordToSentenceProcessor using a sensible default list of tokens for sentence ending for English/Latin writing systems. The default set is: {".","?","!"} and any combination of ! or ?, as in !!!?!?!?!!!?!!?!!!. A sequence of two or more consecutive line breaks is taken as a paragraph break which also splits sentences. This is the usual constructor for sentence breaking reasonable text, which uses hard-line breaking, so two blank lines indicate a paragraph break. People commonly use this constructor.
[中]为英语/拉丁语书写系统的句子结尾创建一个WordToSentenceProcessor,使用一个合理的默认标记列表。默认设置为:{”,"?","!"} 以及任何组合!或就像!!!?!?!?!!!?!!?!!!。两个或两个以上连续换行符的序列被视为一个段落换行符,也可以拆分句子。这是用于合理文本断句的常用构造器,它使用硬换行,因此两个空行表示段落中断。人们通常使用这个构造函数。

代码示例

代码示例来源:origin: stanfordnlp/CoreNLP

public WordsToSentencesAnnotator(boolean verbose, String boundaryTokenRegex,
                 Set<String> boundaryToDiscard, Set<String> htmlElementsToDiscard,
                 String newlineIsSentenceBreak, String boundaryMultiTokenRegex,
                 Set<String> tokenRegexesToDiscard) {
 this(verbose, false,
     new WordToSentenceProcessor<>(boundaryTokenRegex, null,
         boundaryToDiscard, htmlElementsToDiscard,
         WordToSentenceProcessor.stringToNewlineIsSentenceBreak(newlineIsSentenceBreak),
         (boundaryMultiTokenRegex != null) ? TokenSequencePattern.compile(boundaryMultiTokenRegex) : null, tokenRegexesToDiscard));
}

代码示例来源:origin: stanfordnlp/CoreNLP

/** Return a WordsToSentencesAnnotator that never splits the token stream. You just get one sentence.
 *
 *  @return A WordsToSentenceAnnotator.
 */
public static WordsToSentencesAnnotator nonSplitter() {
 WordToSentenceProcessor<CoreLabel> wts = new WordToSentenceProcessor<>(true);
 return new WordsToSentencesAnnotator(false, false, wts);
}

代码示例来源:origin: stanfordnlp/CoreNLP

wts = new WordToSentenceProcessor<>();

代码示例来源:origin: stanfordnlp/CoreNLP

/**
  * For internal debugging purposes only.
  */
 public static void main(String[] args) {
  new BasicDocument<String>();
  Document<String, Word, Word> htmlDoc = BasicDocument.init("top text <h1>HEADING text</h1> this is <p>new paragraph<br>next line<br/>xhtml break etc.");
  System.out.println("Before:");
  System.out.println(htmlDoc);
  Document<String, Word, Word> txtDoc = new StripTagsProcessor<String, Word>(true).processDocument(htmlDoc);
  System.out.println("After:");
  System.out.println(txtDoc);
  Document<String, Word, List<Word>> sentences = new WordToSentenceProcessor<Word>().processDocument(txtDoc);
  System.out.println("Sentences:");
  System.out.println(sentences);
 }
}

代码示例来源:origin: stanfordnlp/CoreNLP

/** Return a WordsToSentencesAnnotator that splits on newlines (only), which are then deleted.
 *  This constructor counts the lines by putting in empty token lists for empty lines.
 *  It tells the underlying splitter to return empty lists of tokens
 *  and then treats those empty lists as empty lines.  We don't
 *  actually include empty sentences in the annotation, though. But they
 *  are used in numbering the sentence. Only this constructor leads to
 *  empty sentences.
 *
 *  @param  nlToken Zero or more new line tokens, which might be a {@literal \n} or the fake
 *                 newline tokens returned from the tokenizer.
 *  @return A WordsToSentenceAnnotator.
 */
public static WordsToSentencesAnnotator newlineSplitter(String... nlToken) {
 // this constructor will keep empty lines as empty sentences
 WordToSentenceProcessor<CoreLabel> wts =
     new WordToSentenceProcessor<>(ArrayUtils.asImmutableSet(nlToken));
 return new WordsToSentencesAnnotator(false, true, wts);
}

代码示例来源:origin: stanfordnlp/CoreNLP

public static void addEnhancedSentences(Annotation doc) {
 //for every sentence that begins a paragraph: append this sentence and the previous one and see if sentence splitter would make a single sentence out of it. If so, add as extra sentence.
 //for each sieve that potentially uses augmentedSentences in original:
 List<CoreMap> sentences = doc.get(CoreAnnotations.SentencesAnnotation.class);
 WordToSentenceProcessor wsp =
     new WordToSentenceProcessor(WordToSentenceProcessor.NewlineIsSentenceBreak.NEVER); //create SentenceSplitter that never splits on newline
 int prevParagraph = 0;
 for(int i = 1; i < sentences.size(); i++) {
  CoreMap sentence = sentences.get(i);
  CoreMap prevSentence = sentences.get(i-1);
  List<CoreLabel> tokensConcat = new ArrayList<>();
  tokensConcat.addAll(prevSentence.get(CoreAnnotations.TokensAnnotation.class));
  tokensConcat.addAll(sentence.get(CoreAnnotations.TokensAnnotation.class));
  List<List<CoreLabel>> sentenceTokens = wsp.process(tokensConcat);
  if(sentenceTokens.size() == 1) { //wsp would have put them into a single sentence --> add enhanced sentence.
   sentence.set(EnhancedSentenceAnnotation.class, constructSentence(sentenceTokens.get(0), prevSentence, sentence));
  }
 }
}

代码示例来源:origin: stanfordnlp/CoreNLP

new WordToSentenceProcessor<>(ArrayUtils.asImmutableSet(new String[]{"\n"}));
  this.countLineNumbers = true;
  this.wts = wts1;
      new WordToSentenceProcessor<>(ArrayUtils.asImmutableSet(new String[]{System.lineSeparator(), "\n"}));
  this.countLineNumbers = true;
  this.wts = wts1;
     new WordToSentenceProcessor<>(ArrayUtils.asImmutableSet(new String[]{PTBTokenizer.getNewlineToken()}));
 this.countLineNumbers = true;
 this.wts = wts1;
if (Boolean.parseBoolean(isOneSentence)) { // this method treats null as false
 WordToSentenceProcessor<CoreLabel> wts1 = new WordToSentenceProcessor<>(true);
 this.countLineNumbers = false;
 this.wts = wts1;
 this.wts = new WordToSentenceProcessor<>(boundaryTokenRegex, boundaryFollowersRegex,
   boundariesToDiscard, htmlElementsToDiscard,
   WordToSentenceProcessor.stringToNewlineIsSentenceBreak(nlsb),

代码示例来源:origin: edu.stanford.nlp/corenlp

public WordsToSentencesAnnotator(boolean verbose) {
 VERBOSE = verbose;
 wts = new WordToSentenceProcessor<CoreLabel>();
}

代码示例来源:origin: com.guokr/stan-cn-com

public WordsToSentencesAnnotator(boolean verbose) {
 this(verbose, false, new WordToSentenceProcessor<CoreLabel>());
}

代码示例来源:origin: com.guokr/stan-cn-com

/** Return a WordsToSentencesAnnotator that never splits the token stream. You just get one sentence.
 *
 *  @param verbose Whether it is verbose.
 *  @return A WordsToSentenceAnnotator.
 */
public static WordsToSentencesAnnotator nonSplitter(boolean verbose) {
 WordToSentenceProcessor<CoreLabel> wts = new WordToSentenceProcessor<CoreLabel>(true);
 return new WordsToSentencesAnnotator(verbose, false, wts);
}

代码示例来源:origin: edu.stanford.nlp/stanford-corenlp

public WordsToSentencesAnnotator(boolean verbose, String boundaryTokenRegex,
                 Set<String> boundaryToDiscard, Set<String> htmlElementsToDiscard,
                 String newlineIsSentenceBreak, String boundaryMultiTokenRegex,
                 Set<String> tokenRegexesToDiscard) {
 this(verbose, false,
     new WordToSentenceProcessor<>(boundaryTokenRegex, null,
         boundaryToDiscard, htmlElementsToDiscard,
         WordToSentenceProcessor.stringToNewlineIsSentenceBreak(newlineIsSentenceBreak),
         (boundaryMultiTokenRegex != null) ? TokenSequencePattern.compile(boundaryMultiTokenRegex) : null, tokenRegexesToDiscard));
}

代码示例来源:origin: edu.stanford.nlp/stanford-corenlp

/** Return a WordsToSentencesAnnotator that never splits the token stream. You just get one sentence.
 *
 *  @return A WordsToSentenceAnnotator.
 */
public static WordsToSentencesAnnotator nonSplitter() {
 WordToSentenceProcessor<CoreLabel> wts = new WordToSentenceProcessor<>(true);
 return new WordsToSentencesAnnotator(false, false, wts);
}

代码示例来源:origin: com.guokr/stan-cn-com

public WordsToSentencesAnnotator(boolean verbose, String boundaryTokenRegex,
                 Set<String> boundaryToDiscard, Set<String> htmlElementsToDiscard,
                 String newlineIsSentenceBreak) {
 this(verbose, false,
    new WordToSentenceProcessor<CoreLabel>(boundaryTokenRegex,
        boundaryToDiscard, htmlElementsToDiscard,
        WordToSentenceProcessor.stringToNewlineIsSentenceBreak(newlineIsSentenceBreak)));
}

代码示例来源:origin: edu.stanford.nlp/corenlp

public static WordsToSentencesAnnotator newlineSplitter(boolean verbose) {
 WordToSentenceProcessor<CoreLabel> wts = 
  new WordToSentenceProcessor<CoreLabel>("", 
                      Collections.<String>emptySet(),
                      Collections.singleton("\n"));
 return new WordsToSentencesAnnotator(wts, verbose);
}

代码示例来源:origin: stackoverflow.com

//split via PTBTokenizer (PTBLexer)
   List<CoreLabel> tokens = PTBTokenizer.coreLabelFactory().getTokenizer(new StringReader(text)).tokenize();
   //do the processing using stanford sentence splitter (WordToSentenceProcessor)
   WordToSentenceProcessor processor = new WordToSentenceProcessor();
   List<List<CoreLabel>> splitSentences = processor.process(tokens);
   //for each sentence
   for (List<CoreLabel> s : splitSentences) {                
     //for each word
     for (CoreLabel token : s) {
       //here you can get the token value and position like;
       //token.value(), token.beginPosition(), token.endPosition()
     }    
   }

代码示例来源:origin: com.guokr/stan-cn-com

public WordsToSentencesAnnotator(boolean verbose, String boundaryTokenRegex,
                 Set<String> boundaryToDiscard, Set<String> htmlElementsToDiscard,
                 String newlineIsSentenceBreak, String boundaryMultiTokenRegex,
                 Set<String> tokenRegexesToDiscard) {
 this(verbose, false,
     new WordToSentenceProcessor<CoreLabel>(boundaryTokenRegex,
         boundaryToDiscard, htmlElementsToDiscard,
         WordToSentenceProcessor.stringToNewlineIsSentenceBreak(newlineIsSentenceBreak),
         (boundaryMultiTokenRegex != null)? TokenSequencePattern.compile(boundaryMultiTokenRegex):null, tokenRegexesToDiscard));
}

代码示例来源:origin: edu.stanford.nlp/corenlp

/**
  * For internal debugging purposes only.
  */
 public static void main(String[] args) {
  new BasicDocument<String>();
  Document<String, Word, Word> htmlDoc = BasicDocument.init("top text <h1>HEADING text</h1> this is <p>new paragraph<br>next line<br/>xhtml break etc.");
  System.out.println("Before:");
  System.out.println(htmlDoc);
  Document<String, Word, Word> txtDoc = new StripTagsProcessor<String, Word>(true).processDocument(htmlDoc);
  System.out.println("After:");
  System.out.println(txtDoc);
  Document<String, Word, List<Word>> sentences = new WordToSentenceProcessor<Word>().processDocument(txtDoc);
  System.out.println("Sentences:");
  System.out.println(sentences);
 }
}

代码示例来源:origin: edu.stanford.nlp/stanford-corenlp

/**
  * For internal debugging purposes only.
  */
 public static void main(String[] args) {
  new BasicDocument<String>();
  Document<String, Word, Word> htmlDoc = BasicDocument.init("top text <h1>HEADING text</h1> this is <p>new paragraph<br>next line<br/>xhtml break etc.");
  System.out.println("Before:");
  System.out.println(htmlDoc);
  Document<String, Word, Word> txtDoc = new StripTagsProcessor<String, Word>(true).processDocument(htmlDoc);
  System.out.println("After:");
  System.out.println(txtDoc);
  Document<String, Word, List<Word>> sentences = new WordToSentenceProcessor<Word>().processDocument(txtDoc);
  System.out.println("Sentences:");
  System.out.println(sentences);
 }
}

代码示例来源:origin: edu.stanford.nlp/stanford-parser

/**
  * For internal debugging purposes only.
  */
 public static void main(String[] args) {
  new BasicDocument<String>();
  Document<String, Word, Word> htmlDoc = BasicDocument.init("top text <h1>HEADING text</h1> this is <p>new paragraph<br>next line<br/>xhtml break etc.");
  System.out.println("Before:");
  System.out.println(htmlDoc);
  Document<String, Word, Word> txtDoc = new StripTagsProcessor<String, Word>(true).processDocument(htmlDoc);
  System.out.println("After:");
  System.out.println(txtDoc);
  Document<String, Word, List<Word>> sentences = new WordToSentenceProcessor<Word>().processDocument(txtDoc);
  System.out.println("Sentences:");
  System.out.println(sentences);
 }
}

代码示例来源:origin: com.guokr/stan-cn-com

/**
  * For internal debugging purposes only.
  */
 public static void main(String[] args) {
  new BasicDocument<String>();
  Document<String, Word, Word> htmlDoc = BasicDocument.init("top text <h1>HEADING text</h1> this is <p>new paragraph<br>next line<br/>xhtml break etc.");
  System.out.println("Before:");
  System.out.println(htmlDoc);
  Document<String, Word, Word> txtDoc = new StripTagsProcessor<String, Word>(true).processDocument(htmlDoc);
  System.out.println("After:");
  System.out.println(txtDoc);
  Document<String, Word, List<Word>> sentences = new WordToSentenceProcessor<Word>().processDocument(txtDoc);
  System.out.println("Sentences:");
  System.out.println(sentences);
 }
}

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com