gpt4 book ai didi

python - 在 NLP 中提取包括连字符在内的复合名词时遇到问题

转载 作者:行者123 更新时间:2023-12-04 08:45:06 26 4
gpt4 key购买 nike

背景和目标
我想从每个句子中提取名词和复合名词,包括连字符,如下所示。
如果它包含连字符,我需要用连字符提取它。

{The T-shirt is old.: ['T-shirt'], 
I bought the computer and the new web-cam.: ['computer', 'web-cam'],
I bought the computer and the new web camera.: ['computer', 'web camera']}
问题
当前输出如下。
复合名词的第一个词上有标签“复合”,但我现在无法提取我所期望的内容。
T T PROPN NNP compound X True False
shirt shirt NOUN NN nsubj xxxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
cam cam NOUN NN conj xxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
camera camera NOUN NN conj xxxx True False

{The T-shirt is old.: ['T -', 'T', 'T -', 'shirt'],
I bought the computer and the new web-cam.: ['web -', 'computer', 'web -', 'web', 'web -', 'cam'],
I bought the computer and the new web camera.: ['web camera', 'computer', 'web camera', 'web', 'web camera', 'camera']}

当前代码
我正在使用 NLP 库 spaCy 来区分名词和复合名词。
希望听到您的建议如何修复当前代码。
import spacy
nlp = spacy.load("en_core_web_sm")

texts = ["The T-shirt is old.", "I bought the computer and the new web-cam.", "I bought the computer and the new web camera."]

nouns = []*len(texts)
dic = {k: v for k, v in zip(texts, nouns)}

for i in range(len(texts)):
text = nlp(texts[i])
words = []
for word in text:
if word.pos_ == 'NOUN'or word.pos_ == 'PROPN':
print(word.text, word.lemma_, word.pos_, word.tag_, word.dep_,
word.shape_, word.is_alpha, word.is_stop)

#compound words
for j in range(len(text)):
token = text[j]
if token.dep_ == 'compound':
if j < len(text)-1:
nexttoken = text[j+1]
words.append(str(token.text + ' ' + nexttoken.text))


else:
words.append(word.text)
dic[text] = words
print(dic)
开发环境
python 3.7.4
SpaCy 版本 2.3.2

最佳答案

请尝试:

import spacy
nlp = spacy.load("en_core_web_sm")

texts = ("The T-shirt is old",
"I bought the computer and the new web-cam",
"I bought the computer and the new web camera",
)
docs = nlp.pipe(texts)

compounds = []
for doc in docs:
compounds.append({doc.text:[doc[tok.i:tok.head.i+1] for tok in doc if tok.dep_=="compound"]})
print(compounds)
[{'The T-shirt is old.': [T-shirt]},
{'I bought the computer and the new web-cam.': [web-cam]},
{'I bought the computer and the new web camera.': [web camera]}]
computer此列表中缺少 ,但我认为它不符合化合物的条件。

关于python - 在 NLP 中提取包括连字符在内的复合名词时遇到问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64365478/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com