gpt4 book ai didi

python - 仅获取标记化句子作为 Stanford Core NLP 的输出

转载 作者:太空宇宙 更新时间:2023-11-03 14:53:40 25 4
gpt4 key购买 nike

我需要拆分句子。我正在为 python3 使用 pycorenlp 包装器。我使用以下命令从我的 jar 目录启动服务器:java -mx4g -cp "*"edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

我运行了以下命令:

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
text = 'Pusheen and Smitha walked along the beach. Pusheen wanted to surf, but fell off the surfboard.'
output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit', 'outputFormat': 'text'})
print (output)

给出了以下输出:

Sentence #1 (8 tokens):
Pusheen and Smitha walked along the beach.
[Text=Pusheen CharacterOffsetBegin=0 CharacterOffsetEnd=7]
[Text=and CharacterOffsetBegin=8 CharacterOffsetEnd=11]
[Text=Smitha CharacterOffsetBegin=12 CharacterOffsetEnd=18]
[Text=walked CharacterOffsetBegin=19 CharacterOffsetEnd=25]
[Text=along CharacterOffsetBegin=26 CharacterOffsetEnd=31]
[Text=the CharacterOffsetBegin=32 CharacterOffsetEnd=35]
[Text=beach CharacterOffsetBegin=36 CharacterOffsetEnd=41]
[Text=. CharacterOffsetBegin=41 CharacterOffsetEnd=42]
Sentence #2 (11 tokens):
Pusheen wanted to surf, but fell off the surfboard.
[Text=Pusheen CharacterOffsetBegin=43 CharacterOffsetEnd=50]
[Text=wanted CharacterOffsetBegin=51 CharacterOffsetEnd=57]
[Text=to CharacterOffsetBegin=58 CharacterOffsetEnd=60]
[Text=surf CharacterOffsetBegin=61 CharacterOffsetEnd=65]
[Text=, CharacterOffsetBegin=65 CharacterOffsetEnd=66]
[Text=but CharacterOffsetBegin=67 CharacterOffsetEnd=70]
[Text=fell CharacterOffsetBegin=71 CharacterOffsetEnd=75]
[Text=off CharacterOffsetBegin=76 CharacterOffsetEnd=79]
[Text=the CharacterOffsetBegin=80 CharacterOffsetEnd=83]
[Text=surfboard CharacterOffsetBegin=84 CharacterOffsetEnd=93]
[Text=. CharacterOffsetBegin=93 CharacterOffsetEnd=94]

我需要以下格式的输出:

Pusheen and Smitha walked along the beach.
Pusheen wanted to surf, but fell off the surfboard.

最佳答案

试试 new "shiny" Stanford CoreNLP API in NLTK =)

首先:

pip install -U nltk[corenlp]

在命令行上:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

那么在Python中,标准的用法是:

>>> from nltk.parse.corenlp import CoreNLPParser
>>> stanford = CoreNLPParser('http://localhost:9000')
>>> text = 'Pusheen and Smitha walked along the beach. Pusheen wanted to surf, but fell off the surfboard.'

# Gets you the tokens.
>>> ' '.join(next(stanford.raw_parse(text)).leaves())
u'Pusheen and Smitha walked along the beach . Pusheen wanted to surf , but fell off the surfboard .'

# Gets you the Tree object.
>>> next(stanford.raw_parse(text))
Tree('ROOT', [Tree('S', [Tree('S', [Tree('NP', [Tree('NNP', ['Pusheen']), Tree('CC', ['and']), Tree('NNP', ['Smitha'])]), Tree('VP', [Tree('VBD', ['walked']), Tree('PP', [Tree('IN', ['along']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['beach'])])])]), Tree('.', ['.'])]), Tree('NP', [Tree('NNP', ['Pusheen'])]), Tree('VP', [Tree('VP', [Tree('VBD', ['wanted']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NN', ['surf'])])])]), Tree(',', [',']), Tree('CC', ['but']), Tree('VP', [Tree('VBD', ['fell']), Tree('PRT', [Tree('RP', ['off'])]), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['surfboard'])])])]), Tree('.', ['.'])])])

# Gets you the pretty png tree.
>>> next(stanford.raw_parse(text)).draw()

[输出]:

enter image description here


要获得标记化的句子,您需要一些技巧:

>>> from nltk.parse.corenlp import CoreNLPParser
>>> stanford = CoreNLPParser('http://localhost:9000')

# Using the CoreNLPParser.api_call() function, ...
>>> stanford.api_call
<bound method CoreNLPParser.api_call of <nltk.parse.corenlp.CoreNLPParser object at 0x107131b90>>

# ... , you can get the JSON output from the CoreNLP tool.
>>> stanford.api_call(text, properties={'annotators': 'tokenize,ssplit'})
{u'sentences': [{u'tokens': [{u'index': 1, u'word': u'Pusheen', u'after': u' ', u'characterOffsetEnd': 7, u'characterOffsetBegin': 0, u'originalText': u'Pusheen', u'before': u''}, {u'index': 2, u'word': u'and', u'after': u' ', u'characterOffsetEnd': 11, u'characterOffsetBegin': 8, u'originalText': u'and', u'before': u' '}, {u'index': 3, u'word': u'Smitha', u'after': u' ', u'characterOffsetEnd': 18, u'characterOffsetBegin': 12, u'originalText': u'Smitha', u'before': u' '}, {u'index': 4, u'word': u'walked', u'after': u' ', u'characterOffsetEnd': 25, u'characterOffsetBegin': 19, u'originalText': u'walked', u'before': u' '}, {u'index': 5, u'word': u'along', u'after': u' ', u'characterOffsetEnd': 31, u'characterOffsetBegin': 26, u'originalText': u'along', u'before': u' '}, {u'index': 6, u'word': u'the', u'after': u' ', u'characterOffsetEnd': 35, u'characterOffsetBegin': 32, u'originalText': u'the', u'before': u' '}, {u'index': 7, u'word': u'beach', u'after': u'', u'characterOffsetEnd': 41, u'characterOffsetBegin': 36, u'originalText': u'beach', u'before': u' '}, {u'index': 8, u'word': u'.', u'after': u' ', u'characterOffsetEnd': 42, u'characterOffsetBegin': 41, u'originalText': u'.', u'before': u''}], u'index': 0}, {u'tokens': [{u'index': 1, u'word': u'Pusheen', u'after': u' ', u'characterOffsetEnd': 50, u'characterOffsetBegin': 43, u'originalText': u'Pusheen', u'before': u' '}, {u'index': 2, u'word': u'wanted', u'after': u' ', u'characterOffsetEnd': 57, u'characterOffsetBegin': 51, u'originalText': u'wanted', u'before': u' '}, {u'index': 3, u'word': u'to', u'after': u' ', u'characterOffsetEnd': 60, u'characterOffsetBegin': 58, u'originalText': u'to', u'before': u' '}, {u'index': 4, u'word': u'surf', u'after': u'', u'characterOffsetEnd': 65, u'characterOffsetBegin': 61, u'originalText': u'surf', u'before': u' '}, {u'index': 5, u'word': u',', u'after': u' ', u'characterOffsetEnd': 66, u'characterOffsetBegin': 65, u'originalText': u',', u'before': u''}, {u'index': 6, u'word': u'but', u'after': u' ', u'characterOffsetEnd': 70, u'characterOffsetBegin': 67, u'originalText': u'but', u'before': u' '}, {u'index': 7, u'word': u'fell', u'after': u' ', u'characterOffsetEnd': 75, u'characterOffsetBegin': 71, u'originalText': u'fell', u'before': u' '}, {u'index': 8, u'word': u'off', u'after': u' ', u'characterOffsetEnd': 79, u'characterOffsetBegin': 76, u'originalText': u'off', u'before': u' '}, {u'index': 9, u'word': u'the', u'after': u' ', u'characterOffsetEnd': 83, u'characterOffsetBegin': 80, u'originalText': u'the', u'before': u' '}, {u'index': 10, u'word': u'surfboard', u'after': u'', u'characterOffsetEnd': 93, u'characterOffsetBegin': 84, u'originalText': u'surfboard', u'before': u' '}, {u'index': 11, u'word': u'.', u'after': u'', u'characterOffsetEnd': 94, u'characterOffsetBegin': 93, u'originalText': u'.', u'before': u''}], u'index': 1}]}

>>> output_json = stanford.api_call(text, properties={'annotators': 'tokenize,ssplit'})
>>> len(output_json['sentences'])
2
>>> for sent in output_json['sentences']:
... start_offset = sent['tokens'][0]['characterOffsetBegin'] # Begin offset of first token.
... end_offset = sent['tokens'][-1]['characterOffsetEnd'] # End offset of last token.
... sent_str = text[start_offset:end_offset]
... print sent_str
...
Pusheen and Smitha walked along the beach.
Pusheen wanted to surf, but fell off the surfboard.

关于python - 仅获取标记化句子作为 Stanford Core NLP 的输出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44292616/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com