gpt4 book ai didi

python-3.x - 值错误 : X has 1709 features per sample; expecting 2444

转载 作者:行者123 更新时间:2023-12-05 01:16:57 25 4
gpt4 key购买 nike

我正在使用这段代码:

import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
import re

使用TFIDF向量化

from sklearn.feature_extraction.text import TfidfVectorizer
tv=TfidfVectorizer(max_df=0.5,min_df=2,stop_words='english')

加载数据文件

df=pd.read_json('train.json',orient='columns')
test_df=pd.read_json('test.json',orient='columns')

df['seperated_ingredients'] = df['ingredients'].apply(','.join)
test_df['seperated_ingredients'] = test_df['ingredients'].apply(','.join)

df['seperated_ingredients']=df['seperated_ingredients'].str.lower()
test_df['seperated_ingredients']=test_df['seperated_ingredients'].str.lower()

cuisines={'thai':0,'vietnamese':1,'spanish':2,'southern_us':3,'russian':4,'moroccan':5,'mexican':6,'korean':7,'japanese':8,'jamaican':9,'italian':10,'irish':11,'indian':12,'greek':13,'french':14,'filipino':15,'chinese':16,'cajun_creole':17,'british':18,'brazilian':19 }
df.cuisine= [cuisines[item] for item in df.cuisine]

做预处理

ho=df['seperated_ingredients']
ho=ho.replace(r'#([^\s]+)', r'\1', regex=True)
ho=ho.replace('\'"',regex=True)

ho=tv.fit_transform(ho)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(ho,df['cuisine'],random_state=0)


from sklearn.linear_model import LogisticRegression
clf= LogisticRegression(penalty='l1')
clf.fit(X_train, y_train)
clf.score(X_test,y_test)

from sklearn.linear_model import LogisticRegression
clf1= LogisticRegression(penalty='l1')
clf1.fit(ho,df['cuisine'])

hs=test_df['seperated_ingredients']

hs=hs.replace(r'#([^\s]+)', r'\1', regex=True)
hs=hs.replace('\'"',regex=True)
hs=tv.fit_transform(hs)

ss=clf1.predict(hs) # this line is giving error.

预测时出现上述错误。有谁知道我做错了什么?

最佳答案

您不应该 retrofit tfidf-vectorizer,而是使用具有相同词汇形状的相同矢量化器来编码测试数据。 docs 中有方法说明:

fit_transform(raw_documents, y=None)
Learn vocabulary and idf, return term-document matrix.
This is equivalent to fit followed by transform, but more efficiently implemented.

transform(raw_documents, copy=True)
Transform documents to document-term matrix.
Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).

你有 ValueError: X has 1709 features per sample;期望 2444 因为矢量化器已用测试数据进行了 retrofit 并创建了新的词汇表,因此测试数据被编码为另一种形状的数组。只需使用 print(len(tv.vocabulary_)) 检查第二次 fit_transform 前后的词汇量。此外,tf-idf 词汇可能在 retrofit 过程中被重新排序。

ho=df['seperated_ingredients']
ho=ho.replace(r'#([^\s]+)', r'\1', regex=True)
ho=ho.replace('\'"',regex=True)
ho=tv.fit_transform(ho)

然后使用预训练的 tf-idf 矢量化器使用变换函数对数据进行编码:

hs=test_df['seperated_ingredients']
hs=hs.replace(r'#([^\s]+)', r'\1', regex=True)
hs=hs.replace('\'"',regex=True)
hs=tv.transform(hs)

使用相同的词汇进行转换,因此输出数组具有正确的形状。

关于python-3.x - 值错误 : X has 1709 features per sample; expecting 2444,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52150800/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com