gpt4 book ai didi

python - 读取 SQL 文件并使用 Count Vectorizer 获取单词出现次数

转载 作者:太空宇宙 更新时间:2023-11-03 15:41:29 25 4
gpt4 key购买 nike

我想读取一个 SQL 文件并使用 CountVectorizer 获取单词出现次数。

到目前为止,我有以下代码:

import re
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer




df = pd.read_sql(q, dlconn)
print(df)

count_vect = CountVectorizer()
X_train_counts= count_vect.fit_transform(df)

print(X_train_counts.shape)
print(count_vect.vocabulary_)

这给出了 'cat': 1, 'dog': 0 的输出

它似乎只取列 animal 的名称并从那里开始计数。

我如何让它访问完整的列并获得显示列中每个单词及其频率的图表?

最佳答案

根据 the CountVectorizer docs ,方法 fit_transform() 需要一个可迭代的字符串。它不能直接处理 DataFrame

但是遍历数据框会返回列的标签,而不是值。我建议你试试 df.itertuples()相反。

尝试这样的事情:

value_list = [
row[0]
for row in df.itertuples(index=False, name=None)]
print(value_list)
print(type(value_list))
print(type(value_list[0]))

X_train_counts = count_vect.fit_transform(value_list)

value_list 中的每个值都应该是 str 类型。让我们知道这是否有帮助。


这是一个小例子:

>>> import pandas as pd
>>> df = pd.DataFrame(['my big dog', 'my lazy cat'])
>>> df
0
0 my big dog
1 my lazy cat

>>> value_list = [row[0] for row in df.itertuples(index=False, name=None)]
>>> value_list
['my big dog', 'my lazy cat']

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> cv = CountVectorizer()
>>> x_train = cv.fit_transform(value_list)
>>> x_train
<2x5 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> x_train.toarray()
array([[1, 0, 1, 0, 1],
[0, 1, 0, 1, 1]], dtype=int64)
>>> cv.vocabulary_
{'my': 4, 'big': 0, 'dog': 2, 'lazy': 3, 'cat': 1}

现在您可以显示每一行的字数(每个输入字符串分别显示):

>>> for word, col in cv.vocabulary_.items():
... for row in range(x_train.shape[0]):
... print('word:{:10s} | row:{:2d} | count:{:2d}'.format(word, row, x_train[row,col]))
word:my | row: 0 | count: 1
word:my | row: 1 | count: 1
word:big | row: 0 | count: 1
word:big | row: 1 | count: 0
word:dog | row: 0 | count: 1
word:dog | row: 1 | count: 0
word:lazy | row: 0 | count: 0
word:lazy | row: 1 | count: 1
word:cat | row: 0 | count: 0
word:cat | row: 1 | count: 1

您还可以显示总字数(行总和):

>>> x_train_sum = x_train.sum(axis=0)
>>> x_train_sum
matrix([[1, 1, 1, 1, 2]], dtype=int64)
>>> for word, col in cv.vocabulary_.items():
... print('word:{:10s} | count:{:2d}'.format(word, x_train_sum[0, col]))
word:my | count: 2
word:big | count: 1
word:dog | count: 1
word:lazy | count: 1
word:cat | count: 1

>>> with open('my-file.csv', 'w') as f:
... for word, col in cv.vocabulary_.items():
... f.write('{};{}\n'.format(word, x_train_sum[0, col]))

这应该阐明您可以如何使用您拥有的工具。

关于python - 读取 SQL 文件并使用 Count Vectorizer 获取单词出现次数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52337046/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com