(Note: I am aware that there have been previous posts on this question (e.g. here or here, but they are rather old and I think there has been quite some progress in NLP in the past few years.)
(注:我知道以前也有关于这个问题的帖子(例如,这里或这里,但它们都很老了,我认为NLP在过去几年里取得了相当大的进展。)
I am trying to determine the tense of a sentence, using natural language processing in Python.
我正在尝试使用Python中的自然语言处理来确定句子的时态。
Is there an easy-to-use package for this? If not, how would I need to implement solutions in TextBlob, StanfordNLP or Google Cloud Natural Language API?
有没有简单易用的套餐呢?如果不是,我需要如何在TextBlob、StanfordNLP或Google Cloud Natural Language API中实施解决方案?
TextBlob seems easiest to use, and I manage to get the POS tags listed, but I am not sure how I can turn the output into a 'tense prediction value' or simply a best guess on the tense. Moreover, my text is in Spanish, so I would prefer to use GoogleCloud or StanfordNLP (or any other easy to use solution) which support Spanish.
TextBlob似乎最容易使用,我设法列出了POS标签,但我不确定如何将输出转换为“时态预测值”或只是时态的最佳猜测。此外,我的文本是西班牙语,所以我更喜欢使用GoogleCloud或StanfordNLP(或任何其他易于使用的解决方案),支持西班牙语。
I have not managed to work with the Python interface for StanfordNLP.
我还没有成功地使用过StanfordNLP的Python接口。
Google Cloud Natural Language API seems to offer exactly what I need (see here, but I have not managed to find out how I would get to this output. I have used Google Cloud NLP for other analysis (e.g. entity sentiment analysis) and it has worked, so I am confident I could set it up if I find the right example of use.
Google Cloud Natural Language API似乎正好提供了我所需要的(参见此处,但我还没有找到如何才能得到这个输出。我已经使用Google Cloud NLP进行了其他分析(例如,实体情绪分析),它已经奏效了,所以我有信心,如果我找到正确的使用示例,我可以设置它。
Example of textblob:
TextBlob示例:
from textblob import TextBlob
from textblob.taggers import NLTKTagger
nltk_tagger = NLTKTagger()
blob = TextBlob("I am curious to see whether NLP is able to predict the tense of this sentence., pos_tagger=nltk_tagger)
print(blob.pos_tags)
-> this prints the pos tags, how would I convert them into a prediction of the tense of this sentence?
->这打印了pos标签,我如何将它们转换为对这句话的时态的预测?
Example with Google Cloud NLP (after setting up credentials):
Google Cloud NLP示例(设置凭据后):
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
text = "I am curious to see how this works"
client = language.LanguageServiceClient()
document = types.Document(
content=text,
type=enums.Document.Type.PLAIN_TEXT)
tense = (WHAT NEEDS TO COME HERE?)
print(tense)
-> I am not sure about the code that needs to be entered to predict the tense (indicated in the code)
->我不确定需要输入什么代码才能预测时态(代码中指明)
I am quite a newbie to Python so any help on this topic would be highly appreciated! Thanks!
我是一个新手,所以在这个话题上的任何帮助我都会非常感激!谢谢!
更多回答
I don't think any NLP toolkit has a function to detect past tense right away. But you can simply get it from dependency parsing and POS tagging.
我不认为任何NLP工具包都有立即检测过去时的功能。但您可以简单地从依赖项解析和词性标记中获得它。
Do the dependency parse of the sentence and have a look at the root which is the main predicate of the sentence and its POS tag. If it is VBD
(a verb is the past simple form), it is surely past tense. If it is VB
(base form) or VBG
(a gerund), you need to check its dependency children and have check if there is an auxiliary verb (deprel is aux
) having the VBD
tag.
做句子的依存句法分析,看看词根,它是句子的主谓词及其词性标记。如果它是VBD(动词是过去式的简单形式),它肯定是过去式。如果是VB(基本形式)或VBG(动名词),您需要检查它的从属子项,并检查是否有助动词(deprel是AUX)具有VBD标记。
If you need to cover also present/past perfect or past model expressions (I must have had...), you can just extend the conditions.
如果你还需要涵盖现在/过去完成时或过去式的表达(我一定有……),你只需扩展条件即可。
In spacy (my favorite NLP toolkit for Python), you can write it like this (assuming your input is a single sentence):
在Spacy(我最喜欢的用于Python的NLP工具包)中,您可以这样编写它(假设您的输入是一个句子):
import spacy
nlp = spacy.load('en_core_web_sm')
def detect_past_sentece(sentence):
sent = list(nlp(sentence).sents)[0]
return (
sent.root.tag_ == "VBD" or
any(w.dep_ == "aux" and w.tag_ == "VBD" for w in sent.root.children))
With Google Cloud API or StanfordNLP, it would be basically the same, I am just no so familiar with the API.
有了Google Cloud API或StanfordNLP,基本上是一样的,我只是不太熟悉API。
I worked with chatgpt to code this up (correcting it and it advancing a bunch of it in ways that'd take me forever to figure out. So far in the included tests it works pretty good, but it has some problems and could use some help.
我和Chatgpt一起编写了代码(更正了它,它以一种我永远都不会弄明白的方式推进了一大堆东西。到目前为止,在包含的测试中,它运行得很好,但它存在一些问题,可能需要一些帮助。
The code allows detecting the main tense of a sentence (past, present, future, unknown), as well as that of an embedded/subordinate clause. I wanted it for assisting in time adjustment for a separate speech-to-text project -- for a sentence like "Jamie wants to have food in 3 hours.", where it's present tense, but the referenced time is in the future.
该代码允许检测句子的主要时态(过去时、现在时、将来时、未知时)以及嵌入/从属子句的主要时态。我想要它来帮助调整一个单独的语音到文本转换项目的时间--比如“Jamie想在3小时内吃东西。”,它是现在时态,但引用的时间是将来的。
Most of the predicted tests actually work for my time-adjustment project, so I'm leaving those, but some others fail and I don't know how to handle it. For example, "She wants to go to sleep." and "She wants to go sleep in 3 hours." both I'd want the embedded clause to be present and future (respectively). (The current code gets it as "unknown").
Screencap of part of the tests' output
大多数预测的测试实际上对我的时间调整项目有效,所以我放弃了这些测试,但其他一些测试失败了,我不知道如何处理它。例如,“她想去睡觉。”“她想在3个小时后睡觉。”我希望嵌入的子句分别是现在和将来。(当前代码将其视为“未知”)。部分测试结果的截图
I'm thinking, if the main clause is present, and the embedded is unknown, I can place it in the future, but I'd like it to handle the grammar, not just the final "unknown" (unless that's all that's needed).
我在想,如果主子句存在,而嵌入的子句是未知的,我可以将其放在未来,但我希望它处理语法,而不仅仅是最后的“未知”(除非这就是所需的全部)。
Here's the current code.
(Note that the bansi module is for term color codes and is here:
https://gist.github.com/jaggzh/35b3705327ad9b4a3439014b8153384e)
这是当前的代码。(请注意,BANSI模块用于Term颜色代码,此处为:https://gist.github.com/jaggzh/35b3705327ad9b4a3439014b8153384e)
#!/usr/bin/env python3
import spacy
from tabulate import tabulate
from bansi import *
import sys
nlp = spacy.load("en_core_web_sm")
def pe(*x, **y):
print(*x, **y, file=sys.stderr)
def detect_tense(sentence):
sent = list(nlp(sentence).sents)[0]
root_tag = sent.root.tag_
aux_tags = [w.tag_ for w in sent.root.children if w.dep_ == "aux"]
# Detect past tense
if root_tag == "VBD" or "VBD" in aux_tags:
return "past"
# Detect present tense
if root_tag in ["VBG", "VBP", "VBZ"] or ("VBP" in aux_tags or "VBZ" in aux_tags):
return "present"
# Detect future tense (usually indicated by the auxiliary 'will' or 'shall')
if any(w.lower_ in ["will", "shall"] for w in sent.root.children if w.dep_ == "aux"):
return "future"
return "unknown"
def extract_subtree_str(token):
return ' '.join([t.text for t in token.subtree])
def detect_embedded_tense(sentence):
doc = nlp(sentence)
main_tense = "unknown"
embedded_tense = "unknown"
for sent in doc.sents:
root = sent.root
main_tense = detect_tense(sentence) # Detect main clause tense
for child in root.children: # Detect embedded clause tense
if child.dep_ in ["xcomp", "ccomp", "advcl"]:
clause = extract_subtree_str(child)
embedded_tense = detect_tense(clause)
return main_tense, embedded_tense
def show_parts(sentence):
doc = nlp(sentence)
words = [''] + [str(token) for token in doc]
tags = ['pos'] + [token.tag_ for token in doc]
deps = ['dep'] + [token.dep_ for token in doc]
print(tabulate([words, tags, deps]))
# def get_verb_tense(sentence):
# doc = nlp(sentence)
# for token in doc:
# print(f" tag_: {token.tag_}")
# if "VERB" in token.tag_:
# return token.tag_
# return "No verb found"
if __name__ == '__main__':
# Test the function
sentences = [
# (sentence, main_clause_expected_tense, embedded_clause_expected_tense)
("I ate an apple.", "past", "unknown"),
("I had eaten an apple.", "past", "unknown"),
("I am eating an apple.", "present", "unknown"),
("She needs to sleep at 4.", "present", "future"),
("She needed to sleep at 4.", "past", "past"),
("I ate an apple.", "past", "unknown"),
("I had eaten an apple.", "past", "unknown"),
("I am eating an apple.", "present", "unknown"),
("I eat an apple.", "present", "unknown"),
("I have been eating.", "present", "unknown"),
("I will eat an apple.", "future", "unknown"),
("I shall eat an apple.", "future", "unknown"),
("She will eat at 3.", "future", "unknown"),
("She ate at 3.", "past", "unknown"),
("She went to sleep at 4.", "past", "unknown"),
("She has to eat.", "future", "unknown"),
("She wants to go sleep.", "present", "future"), # This could be debated
("She wants to go sleep in 3 hours.", "present", "future"), # This could be debated
("She wanted to go sleep earlier.", "past", "past"),
("I want to be sleeping.", "present", "future"), # This could be debated
("I am sleeping.", "present", "unknown"),
("She is eating.", "present", "unknown"),
]
for s, exp_main_tense, exp_embedded_tense in sentences:
print(f"{bgblu}{yel}-------------------------------------- {rst}")
print(f"{bgblu}{yel} Sent: {s}{rst}")
show_parts(s)
det_main_tense, det_embedded_tense= detect_embedded_tense(s)
print(f" Main Pred-Tense: {yel}{det_main_tense}{rst}")
print(f" Main Exp-Tense: {yel}{exp_main_tense}{rst}")
if det_main_tense== exp_main_tense:
print(f" {bgre}MATCH{rst}")
else:
print(f" {bred}MISMATCH{rst}")
print(f" Embedded Pred-Tense: {yel}{det_embedded_tense}{rst}")
print(f" Embedded Exp-Tense: {yel}{exp_embedded_tense}{rst}")
if det_embedded_tense== exp_embedded_tense:
print(f" {bgre}MATCH{rst}")
else:
print(f" {bred}MISMATCH{rst}")
更多回答
Thank you for your suggestions! I tried this out, but got the following error message: "OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory." Could you doublecheck the link or let me know how I find the correct one? Thanks!
谢谢你的建议!我尝试了这个,但得到以下错误消息:“OSError:[E050]找不到模型'en_core_web_sm'。它似乎不是一个快捷链接,Python包或数据目录的有效路径。“你能重复检查一下链接,或者让我知道如何找到正确的链接吗?谢谢!
With spacy, you need to download NLP models separately, see spacy.io/usage/models. Run python -m spacy download en_core_web_sm
.
使用Spacy时,您需要单独下载NLP型号,请参阅space y.io/用法/型号。运行python-m空格下载en_core_web_sm。
我是一名优秀的程序员,十分优秀!