gpt4 book ai didi

python - 如何有效地将 pos_tag_sents() 应用于 pandas 数据框

转载 作者:太空狗 更新时间:2023-10-29 18:03:34 25 4
gpt4 key购买 nike

在您希望对存储在 pandas 数据框中的一列文本进行 POS 标记的情况下,每行 1 个句子,SO 上的大多数实现都使用 apply 方法

dfData['POSTags']= dfData['SourceText'].apply(
lamda row: [pos_tag(word_tokenize(row) for item in row])

NLTK 文档 recommends using the pos_tag_sents()用于有效标记多个句子。

这是否适用于此示例?如果适用,代码是否会像将 pso_tag 更改为 pos_tag_sents 一样简单,或者 NLTK 是否表示段落的文本源

如评论中所述,pos_tag_sents() 旨在减少每次感受器的负载但问题是如何做到这一点并仍然在 pandas 数据框中生成一列?

Link to Sample Dataset 20kRows

最佳答案

输入

$ cat test.csv 
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat

长话短说

>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0 cozily married practical athletics Mr. Brown flat
1 active married expensive soccer Mr. Chang flat
2 healthy single expensive badminton Mrs. Green ...
3 cozily married practical soccer Mr. Brown hier...
4 cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]

>>> df['POS'] = tagged_texts
>>> df
ID Task label \
0 1 Collect Information no response
1 2 New Credit no response
2 3 Collect Information response
3 4 Collect Information response
4 5 Collect Information response

Text \
0 cozily married practical athletics Mr. Brown flat
1 active married expensive soccer Mr. Chang flat
2 healthy single expensive badminton Mrs. Green ...
3 cozily married practical soccer Mr. Brown hier...
4 cozily single practical badminton Mr. Brown flat

POS
0 [(cozily, RB), (married, JJ), (practical, JJ),...
1 [(active, JJ), (married, VBD), (expensive, JJ)...
2 [(healthy, JJ), (single, JJ), (expensive, JJ),...
3 [(cozily, RB), (married, JJ), (practical, JJ),...
4 [(cozily, RB), (single, JJ), (practical, JJ), ...

在长:

首先,您可以将 Text 列提取到字符串列表中:

texts = df['Text'].tolist()

然后你可以应用word_tokenize函数:

map(word_tokenize, texts)

请注意,@Boud 的建议几乎相同,使用 df.apply:

df['Text'].apply(word_tokenize)

然后将标记化的文本转储到字符串列表中:

df['Text'].apply(word_tokenize).tolist()

然后你可以使用pos_tag_sents:

pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

然后将该列添加回 DataFrame:

df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

关于python - 如何有效地将 pos_tag_sents() 应用于 pandas 数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41674573/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com