gpt4 book ai didi

python - 在 scikit-learn 中实现词袋

转载 作者:行者123 更新时间:2023-12-04 10:52:01 24 4
gpt4 key购买 nike

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
headers = ['label', 'sms_message']
df = pd.read_csv ('spam.csv', names = headers)
df ['label'] = df['label'].map({'ham': 0, 'spam': 1})
print (df.head(7))
print (df.shape)
count_vector = CountVectorizer()
#count_vector.fit(df)
y = count_vector.fit_transform(df)
count_vector.get_feature_names()
doc_array = y.toarray()
print (doc_array)
frequency_matrix = pd.DataFrame(doc_array, columns = count_vector.get_feature_names())
frequency_matrix

示例数据和输出:
   label                                        sms_message
0 0 Go until jurong point, crazy.. Available only ...
1 0 Ok lar... Joking wif u oni...
2 1 Free entry in 2 a wkly comp to win FA Cup fina...
3 0 U dun say so early hor... U c already then say...

(5573, 2)
[[1 0]
[0 1]]

label sms_message
0 1 0
1 0 1

我的问题:

我的 csv 文件基本上是多行短信。

我不明白为什么我只得到列标签的输出,而不是整行 sms 文本。

感谢您的任何帮助。

最佳答案

仅将 sms_message 列传递给计数向量化器,如下所示。

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = ['Tea is an aromatic beverage..',
'After water, it is the most widely consumed drink in the world',
'There are many different types of tea.',
'Tea has a stimulating effect in humans.',
'Tea originated in Southwest China during the Shang dynasty']

df = pd.DataFrame({'sms_message': docs, 'label': np.random.choice([0, 1], size=5)})

cv = CountVectorizer()
counts = cv.fit_transform(df['sms_message'])

df_counts = pd.DataFrame(counts.A, columns=cv.get_feature_names())
df_counts['label'] = df['label']

输出:
df_counts

Out[26]:
after an are aromatic beverage ... types water widely world label
0 0 1 0 1 1 ... 0 0 0 0 1
1 1 0 0 0 0 ... 0 1 1 1 0
2 0 0 1 0 0 ... 1 0 0 0 1
3 0 0 0 0 0 ... 0 0 0 0 1
4 0 0 0 0 0 ... 0 0 0 0 0

[5 rows x 32 columns]

关于python - 在 scikit-learn 中实现词袋,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59435472/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com