python - 仅获取标记化句子作为 Stanford Core NLP 的输出-6ren

python - 仅获取标记化句子作为 Stanford Core NLP 的输出

转载作者：太空宇宙更新时间：2023-11-03 14:53:40

我需要拆分句子。我正在为 python3 使用 pycorenlp 包装器。我使用以下命令从我的 jar 目录启动服务器:java -mx4g -cp "*"edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

我运行了以下命令:

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')    
text = 'Pusheen and Smitha walked along the beach. Pusheen wanted to surf, but fell off the surfboard.'
output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit', 'outputFormat': 'text'})
print (output)

给出了以下输出:

Sentence #1 (8 tokens):
Pusheen and Smitha walked along the beach.
[Text=Pusheen CharacterOffsetBegin=0 CharacterOffsetEnd=7]
[Text=and CharacterOffsetBegin=8 CharacterOffsetEnd=11]
[Text=Smitha CharacterOffsetBegin=12 CharacterOffsetEnd=18]
[Text=walked CharacterOffsetBegin=19 CharacterOffsetEnd=25]
[Text=along CharacterOffsetBegin=26 CharacterOffsetEnd=31]
[Text=the CharacterOffsetBegin=32 CharacterOffsetEnd=35]
[Text=beach CharacterOffsetBegin=36 CharacterOffsetEnd=41]
[Text=. CharacterOffsetBegin=41 CharacterOffsetEnd=42]
Sentence #2 (11 tokens):
Pusheen wanted to surf, but fell off the surfboard.
[Text=Pusheen CharacterOffsetBegin=43 CharacterOffsetEnd=50]
[Text=wanted CharacterOffsetBegin=51 CharacterOffsetEnd=57]
[Text=to CharacterOffsetBegin=58 CharacterOffsetEnd=60]
[Text=surf CharacterOffsetBegin=61 CharacterOffsetEnd=65]
[Text=, CharacterOffsetBegin=65 CharacterOffsetEnd=66]
[Text=but CharacterOffsetBegin=67 CharacterOffsetEnd=70]
[Text=fell CharacterOffsetBegin=71 CharacterOffsetEnd=75]
[Text=off CharacterOffsetBegin=76 CharacterOffsetEnd=79]
[Text=the CharacterOffsetBegin=80 CharacterOffsetEnd=83]
[Text=surfboard CharacterOffsetBegin=84 CharacterOffsetEnd=93]
[Text=. CharacterOffsetBegin=93 CharacterOffsetEnd=94]

我需要以下格式的输出:

Pusheen and Smitha walked along the beach.
Pusheen wanted to surf, but fell off the surfboard.

最佳答案

试试 new "shiny" Stanford CoreNLP API in NLTK =)

首先:

pip install -U nltk[corenlp]

在命令行上:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

那么在Python中，标准的用法是:

>>> from nltk.parse.corenlp import CoreNLPParser
>>> stanford = CoreNLPParser('http://localhost:9000')
>>> text = 'Pusheen and Smitha walked along the beach. Pusheen wanted to surf, but fell off the surfboard.'

# Gets you the tokens.
>>> ' '.join(next(stanford.raw_parse(text)).leaves())
u'Pusheen and Smitha walked along the beach . Pusheen wanted to surf , but fell off the surfboard .'

# Gets you the Tree object.
>>> next(stanford.raw_parse(text))
Tree('ROOT', [Tree('S', [Tree('S', [Tree('NP', [Tree('NNP', ['Pusheen']), Tree('CC', ['and']), Tree('NNP', ['Smitha'])]), Tree('VP', [Tree('VBD', ['walked']), Tree('PP', [Tree('IN', ['along']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['beach'])])])]), Tree('.', ['.'])]), Tree('NP', [Tree('NNP', ['Pusheen'])]), Tree('VP', [Tree('VP', [Tree('VBD', ['wanted']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NN', ['surf'])])])]), Tree(',', [',']), Tree('CC', ['but']), Tree('VP', [Tree('VBD', ['fell']), Tree('PRT', [Tree('RP', ['off'])]), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['surfboard'])])])]), Tree('.', ['.'])])])

# Gets you the pretty png tree.
>>> next(stanford.raw_parse(text)).draw()

[输出]:

要获得标记化的句子，您需要一些技巧:

>>> from nltk.parse.corenlp import CoreNLPParser
>>> stanford = CoreNLPParser('http://localhost:9000')

# Using the CoreNLPParser.api_call() function, ...
>>> stanford.api_call
<bound method CoreNLPParser.api_call of <nltk.parse.corenlp.CoreNLPParser object at 0x107131b90>>

# ... , you can get the JSON output from the CoreNLP tool.
>>> stanford.api_call(text, properties={'annotators': 'tokenize,ssplit'})
{u'sentences': [{u'tokens': [{u'index': 1, u'word': u'Pusheen', u'after': u' ', u'characterOffsetEnd': 7, u'characterOffsetBegin': 0, u'originalText': u'Pusheen', u'before': u''}, {u'index': 2, u'word': u'and', u'after': u' ', u'characterOffsetEnd': 11, u'characterOffsetBegin': 8, u'originalText': u'and', u'before': u' '}, {u'index': 3, u'word': u'Smitha', u'after': u' ', u'characterOffsetEnd': 18, u'characterOffsetBegin': 12, u'originalText': u'Smitha', u'before': u' '}, {u'index': 4, u'word': u'walked', u'after': u' ', u'characterOffsetEnd': 25, u'characterOffsetBegin': 19, u'originalText': u'walked', u'before': u' '}, {u'index': 5, u'word': u'along', u'after': u' ', u'characterOffsetEnd': 31, u'characterOffsetBegin': 26, u'originalText': u'along', u'before': u' '}, {u'index': 6, u'word': u'the', u'after': u' ', u'characterOffsetEnd': 35, u'characterOffsetBegin': 32, u'originalText': u'the', u'before': u' '}, {u'index': 7, u'word': u'beach', u'after': u'', u'characterOffsetEnd': 41, u'characterOffsetBegin': 36, u'originalText': u'beach', u'before': u' '}, {u'index': 8, u'word': u'.', u'after': u' ', u'characterOffsetEnd': 42, u'characterOffsetBegin': 41, u'originalText': u'.', u'before': u''}], u'index': 0}, {u'tokens': [{u'index': 1, u'word': u'Pusheen', u'after': u' ', u'characterOffsetEnd': 50, u'characterOffsetBegin': 43, u'originalText': u'Pusheen', u'before': u' '}, {u'index': 2, u'word': u'wanted', u'after': u' ', u'characterOffsetEnd': 57, u'characterOffsetBegin': 51, u'originalText': u'wanted', u'before': u' '}, {u'index': 3, u'word': u'to', u'after': u' ', u'characterOffsetEnd': 60, u'characterOffsetBegin': 58, u'originalText': u'to', u'before': u' '}, {u'index': 4, u'word': u'surf', u'after': u'', u'characterOffsetEnd': 65, u'characterOffsetBegin': 61, u'originalText': u'surf', u'before': u' '}, {u'index': 5, u'word': u',', u'after': u' ', u'characterOffsetEnd': 66, u'characterOffsetBegin': 65, u'originalText': u',', u'before': u''}, {u'index': 6, u'word': u'but', u'after': u' ', u'characterOffsetEnd': 70, u'characterOffsetBegin': 67, u'originalText': u'but', u'before': u' '}, {u'index': 7, u'word': u'fell', u'after': u' ', u'characterOffsetEnd': 75, u'characterOffsetBegin': 71, u'originalText': u'fell', u'before': u' '}, {u'index': 8, u'word': u'off', u'after': u' ', u'characterOffsetEnd': 79, u'characterOffsetBegin': 76, u'originalText': u'off', u'before': u' '}, {u'index': 9, u'word': u'the', u'after': u' ', u'characterOffsetEnd': 83, u'characterOffsetBegin': 80, u'originalText': u'the', u'before': u' '}, {u'index': 10, u'word': u'surfboard', u'after': u'', u'characterOffsetEnd': 93, u'characterOffsetBegin': 84, u'originalText': u'surfboard', u'before': u' '}, {u'index': 11, u'word': u'.', u'after': u'', u'characterOffsetEnd': 94, u'characterOffsetBegin': 93, u'originalText': u'.', u'before': u''}], u'index': 1}]} 

>>> output_json = stanford.api_call(text, properties={'annotators': 'tokenize,ssplit'})
>>> len(output_json['sentences'])
2
>>> for sent in output_json['sentences']:
...     start_offset = sent['tokens'][0]['characterOffsetBegin'] # Begin offset of first token.
...     end_offset = sent['tokens'][-1]['characterOffsetEnd'] # End offset of last token.
...     sent_str = text[start_offset:end_offset]
...     print sent_str
... 
Pusheen and Smitha walked along the beach.
Pusheen wanted to surf, but fell off the surfboard.

关于python - 仅获取标记化句子作为 Stanford Core NLP 的输出，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44292616/

文章推荐： python - Anaconda:如何安装不在 Anaconda 目录中的软件包？

文章推荐： c# - ASP.NET 动态生成的事件代码未运行

r - 如何识别不合逻辑的字符串/句子
假设我有一个数据集，每行包含一个句子，该句子来自一个非常大的调查(德语和法语)中的一个开放式问题。大多数句子(答案)是合乎逻辑的；即有意义的单词组合。但是，也有一些粗心的受访者只是简单地填写了各种不合
python - MySQL查询匹配相似的单词/句子
我的 MySQL 数据库中有一个表，其结构如下: CREATE TABLE `papers` ( `id` int(11) NOT NULL AUTO_INCREMENT, `title` varch
javascript - 浏览器如何在大声朗读时突出显示单词/句子？
在新的 Edge 浏览器(基于 chromium)中，有一个文本到语音的选项，在阅读页面时它会突出显示正在阅读的句子和单词，就像这样 - 过去我有一个简单的 Windows TTS 应用程序，我通过将
javascript - 图像中的可选区域/句子
我有一本书的图像文件。我正在编写一个 Web 应用程序，用于加载书籍并一次显示一页。我想知道如何在页面中选择一个句子并显示一条消息。据我所知，它必须具有图像坐标。请参阅http://epaper.d
nmea - 我应该使用哪个 NMEA 句子
我使用的 GPS 输出多个 NMEA 语句，可用于定位数据。 (GPGGA 和 GPRMC)。有什么理由我应该使用一个而不是另一个吗？我应该检查它们并比较数据吗？我可以随便挑一个使用吗？在这一点上，
c++ - 从序列中解析 NMEA 句子
我想使用TinyGPS++在 Arduino 上解析 NMEA 数据并在 OLED 显示屏上显示信息。但是，NMEA 数据将通过 USB 接收，而不是使用软件串行和 TX/RX 引脚。我按照 Tin
java - 如何删除字符串(句子)中的空格
我需要删除其中的所有空格。例如:这是我的代码O/P:Thisismycode 这是我到目前为止的代码。 import java.util.Scanner; public class nospace{
Python Pandas 合并关键字/句子
我对 python 很陌生，我不知道如何解决以下问题: 我有两个数据框，我想使用某种 VLOOKUP 函数来将句子与特定关键字相匹配。在下面的示例中，(df1) 3e 句子应与“banana”(df2
python - 句子[:] mean here?]是什么意思
这个问题已经有答案了: How slicing in Python works (38 个回答) Python list slice syntax used for no obvious reason
mysql在字符串(句子)中找到一个以#开头的单词并得到它的计数
我想在我的表格作者的句子列中找到以 # 开头的单词。我不知道我在寻找什么词，因为我只知道它以 # 开头。表:作者(姓名，句子) 作者 |句子艾伯特 |我#want to be #discussin
html - 显示没有异常空格的
句子
关闭。这个问题需要details or clarity .它目前不接受答案。想改进这个问题吗？通过 editing this post 添加细节并澄清问题. 关闭 9 年前。 Improve
javascript - 免费编程营挑战 - 标题案例 A 句子
我目前正在经历免费代码营的第一个 JS 挑战。我在标题为“句子首字母大写”的挑战中遇到了问题。在这个挑战中，我需要编写一个函数，将给定字符串中单词的每个第一个字母大写，并将所有其他字母小写。 Her
javascript - 在多个(句子)对象上保存用户突出显示的文本
假设我有一个文本，看起来像这样: Some sentence in which a fox jumps over some fence. Another sentence in which a
C++ 字符 [句子/单词]
我是 C++ 的初学者，我想了解有关字符的更多信息，但我遇到了问题。我试图制作一个程序，它复制一个句子并在空格 (' ') 之间添加一个新行 ('\n')，就像一个单词一个单词地分开一个句子. int
C - 将字符串(句子)转换为字符串列表
我需要将一个句子(例如“Hello world”)复制到一个字符串列表中，意思是复制到一个字符数组中，其中每 2 个单词由 '\0' 分隔。请注意，单词被定义为一行中没有空格的任意数量的字符。因此，
python - 从字符串中提取出现在关键字之前的单词/句子 - Python
我有这样一个字符串， my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenni
python nltk -- 句子/短语的词干列表
我在列表中有一堆句子，我想使用 nltk 库来阻止它。我可以一次提取一个句子，但是我在从列表中提取句子并将它们重新组合在一起时遇到了问题。我缺少一个步骤吗？ nltk 库很新。谢谢! import n
c# - 我如何逐字搜索(或获取)句子？
我有一个词和文本。我必须找到所有包含该词的提案。你有什么想法吗？ piblic List GetSnetences(string word) { // search all proposals
python - 如何提高doc2vec模型中两个文档(句子)的余弦相似度？
我正在通过 doc2vec 模型使用 gensim 库在 Python 中构建 NLP 聊天应用程序。我有硬编码的文档并给出了一组训练示例，我通过抛出用户问题来测试模型，然后第一步找到最相似的文档。在
python - 在第一个逗号出现的地方拆分超过 10 个单词的行/句子
我有以下代码，每 10 个单词拆分一行。 #!/bin/bash while read line do counter=1; for word in $line do

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 仅获取标记化句子作为 Stanford Core NLP 的输出