gpt4 book ai didi

python - scikit-learn : ValueError: np. nan 中的 TfidfVectorizer 是无效文档

转载 作者:IT老高 更新时间:2023-10-28 21:52:57 34 4
gpt4 key购买 nike

我正在使用 scikit-learn 的 TfidfVectorizer 从文本数据中提取一些特征。我有一个带有分数(可以是 +1 或 -1)和评论(文本)的 CSV 文件。我将这些数据提取到 DataFrame 中,以便运行 Vectorizer。

这是我的代码:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv("train_new.csv",
names = ['Score', 'Review'], sep=',')

# x = df['Review'] == np.nan
#
# print x.to_csv(path='FindNaN.csv', sep=',', na_rep = 'string', index=True)
#
# print df.isnull().values.any()

v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
x = v.fit_transform(df['Review'])

这是我得到的错误的回溯:

Traceback (most recent call last):
File "/home/PycharmProjects/Review/src/feature_extraction.py", line 16, in <module>
x = v.fit_transform(df['Review'])
File "/home/b/hw1/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 1305, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
self.fixed_vocabulary_)
File "/home/b/work/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
for feature in analyze(doc):
File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 118, in decode
raise ValueError("np.nan is an invalid document, expected byte or "
ValueError: np.nan is an invalid document, expected byte or unicode string.

我检查了 CSV 文件和 DataFrame 是否有任何被读取为 NaN 的内容,但我找不到任何内容。有 18000 行,其中没有一个将 isnan 返回为 True。

这就是 df['Review'].head() 的样子:

  0    This book is such a life saver.  It has been s...
1 I bought this a few times for my older son and...
2 This is great for basics, but I wish the space...
3 This book is perfect! I'm a first time new mo...
4 During your postpartum stay at the hospital th...
Name: Review, dtype: object

最佳答案

您需要将 dtype object 转换为 unicode 字符串,正如回溯中明确提到的那样。

x = v.fit_transform(df['Review'].values.astype('U'))  ## Even astype(str) would work

来自 TFIDF Vectorizer 的文档页面:

fit_transform(raw_documents, y=None)

Parameters: raw_documents : iterable
an iterable which yields either str, unicode or file objects

关于python - scikit-learn : ValueError: np. nan 中的 TfidfVectorizer 是无效文档,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39303912/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com