gpt4 book ai didi

python - 使用 Pandas 和 spaCy 提取句子嵌入特征

转载 作者:行者123 更新时间:2023-12-05 02:06:38 25 4
gpt4 key购买 nike

我目前正在学习 spaCy,并且有一个关于单词和句子嵌入的练习。句子存储在 pandas DataFrame 列中,我们需要根据这些句子的向量训练分类器。

我有一个如下所示的数据框:

+---+---------------------------------------------------+
| | sentence |
+---+---------------------------------------------------+
| 0 | "Whitey on the Moon" is a 1970 spoken word poe... |
+---+---------------------------------------------------+
| 1 | St Anselm's Church is a Roman Catholic church ... |
+---+---------------------------------------------------+
| 2 | Nymphargus grandisonae (common name: giant gla... |
+---+---------------------------------------------------+

接下来,我将 NLP 函数应用于这些句子:

import en_core_web_md
nlp = en_core_web_md.load()
df['tokenized'] = df['sentence'].apply(nlp)

现在,如果我理解正确的话,df['tokenized'] 中的每个项目都有一个属性,该属性返回 2D 数组中句子的向量。

print(type(df['tokenized'][0].vector))
print(df['tokenized'][0].vector.shape)

产量

<class 'numpy.ndarray'>
(300,)

如何将此数组(300 行)的内容作为列添加到相应句子的 df 数据框中,忽略停用词

谢谢!

最佳答案

假设您有句子列表:

sents = ["'Whitey on the Moon' is a 1970 spoken word"
, "St Anselm's Church is a Roman Catholic church"
, "Nymphargus grandisonae (common name: giant)"]

您放入数据框的内容:

df=pd.DataFrame({"sentence":sents})
print(df)
sentence
0 'Whitey on the Moon' is a 1970 spoken word
1 St Anselm's Church is a Roman Catholic church
2 Nymphargus grandisonae (common name: giant)

那么您可以进行如下操作:

df['tokenized'] = df['sentence'].apply(nlp)
df['sent_vectors'] = df['tokenized'].apply(
lambda sent: np.mean([token.vector for token in sent if not token.is_stop])
)

生成的 sent_vectorized 列是非停用词标记的所有向量嵌入的平均值(token.is_stop 属性)。

注1您在数据框中调用的 sentence 实际上是 Doc 类的一个实例。

注2尽管您可能更喜欢通过 pandas 数据框,但推荐的方法是通过 getter 扩展:

import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_md")

sents = ["'Whitey on the Moon' is a 1970 spoken word"
, "St Anselm's Church is a Roman Catholic church"
, "Nymphargus grandisonae (common name: giant)"]

vector_except_stopwords = lambda doc: np.mean([token.vector for token in sent if not token.is_stop])
Doc.set_extension("vector_except_stopwords", getter=vector_except_stopwords)

vecs =[] # for demonstration purposes
for doc in nlp.pipe(sents):
vecs.append(doc._.vector_except_stopwords)

关于python - 使用 Pandas 和 spaCy 提取句子嵌入特征,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62676136/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com