gpt4 book ai didi

python - Pandas + CountVectorizer : how to filter rows quickly

转载 作者:太空宇宙 更新时间:2023-11-03 15:30:20 24 4
gpt4 key购买 nike

我在 Pandas 中有一个文本列:

df['TEXT_COL']

然后我对其应用CountVectorizer:

vectorizer = CountVectorizer()
v = vectorizer.fit_transform(df['TEXT_COL'])

并获取一组单词/特征:

ft = v.get_feature_names()

和 TDM:

m = vectorizer.transform(df['TEXT_COL'])

我需要: df 的切片,其中仅包含包含 feature_set ft 中特定功能的行。

如何获取?

Pandas 设置:

import pandas as pd

data = [('Word'), ('Word Sea Ocean'), ('Tree'), ('Forest Tree')]

df = pd.DataFrame(data)
df.columns = ['TEXT_COL']

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
v = vectorizer.fit_transform(df['TEXT_COL'])

ft = vectorizer.get_feature_names()
m = vectorizer.transform(df['TEXT_COL'])

enter image description here

for f in ft:

???

最佳答案

这是一个小演示:

# execute your setup script ...

In [48]: vectorizer.vocabulary_
Out[48]: {'forest': 0, 'ocean': 1, 'sea': 2, 'tree': 3, 'word': 4}

m 是稀疏矩阵

In [49]: m
Out[49]:
<4x5 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>

我们可以将其转换为常规 numpy 数组:

In [50]: m.toarray()
Out[50]:
array([[0, 0, 0, 0, 1],
[0, 1, 1, 0, 1],
[0, 0, 0, 1, 0],
[1, 0, 0, 1, 0]], dtype=int64)

如何列出特定功能:

In [51]: m[:, vectorizer.vocabulary_['sea']].toarray()
Out[51]:
array([[0],
[1],
[0],
[0]], dtype=int64)

或使用ft:

In [57]: m[:, ft.index('sea')].toarray()
Out[57]:
array([[0],
[1],
[0],
[0]], dtype=int64)

In [52]: df
Out[52]:
TEXT_COL
0 Word
1 Word Sea Ocean
2 Tree
3 Forest Tree

让我们显示包含特征'tree'的所有行:

In [71]: idx = m[:, ft.index('tree')] == 1

In [72]: df[idx.toarray()]
Out[72]:
TEXT_COL
2 Tree
3 Forest Tree

或者像这样:

In [77]: df[m[:, ft.index('tree')].astype(bool).toarray()]
Out[77]:
TEXT_COL
2 Tree
3 Forest Tree

关于python - Pandas + CountVectorizer : how to filter rows quickly,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42931920/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com