gpt4 book ai didi

python - sklearn : Text and Numeric features with ColumnTransformer has value error

转载 作者:行者123 更新时间:2023-11-30 09:16:37 25 4
gpt4 key购买 nike

我正在尝试使用 SKLearn 0.20.2 来创建管道,同时使用新的 ColumnTransformer 功能。我的问题是,当我运行分类器时: clf.fit(x_train, y_train) 我不断收到错误:

ValueError:除串联轴之外的所有输入数组维度必须完全匹配

我有一列名为text的文本 block 。我的所有其他专栏本质上都是数字。我正在尝试在我的管道中使用 Countvectorizer,我认为这就是问题所在。非常感谢您的帮助。

运行管道并检查 x_train/y_train 后,如果有帮助的话,它看起来像这样(省略通常显示在左列中的行号,并且文本列的运行高度高于图像中显示的高度)。

<小时/>
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# plus other necessary modules

# mapped to column names from dataframe
numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median'))
])

# mapped to column names from dataframe
text_features = ['text']
text_transformer = Pipeline(steps=[
('vect', CountVectorizer())
])

preprocessor = ColumnTransformer(
transformers=[('num', numeric_transformer, numeric_features),('text', text_transformer, text_features)]
)

clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', MultinomialNB())
])

x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.33)
clf.fit(x_train,y_train)

最佳答案

如果您运行此代码,Vadim 是正确的

numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = SimpleImputer(strategy='median')

num = numeric_transformer.fit_transform(df[numeric_features])

# num.shape
# (3, 4)

text_features = ['text']
text_transformer = CountVectorizer()

text = text_transformer.fit_transform(df[text_features])

print(text_transformer.get_feature_names())
print(text.toarray())

输出将如下所示。

['text']
[[1]]

这是由于我不止一次遇到的文本处理过程中的一些故障。

如果您将 text_features 定义为字符串而不是单元素列表

text_features = 'text'
text_transformer = CountVectorizer()

text = text_transformer.fit_transform(df[text_features])

print(text_transformer.get_feature_names())
print(text.toarray())`

变成这样

['123', '16118', '17569', '456', '8779', '9480']
[[0 0 1 0 1 0]
[0 1 0 0 0 1]
[1 0 0 1 0 0]]

这就是你想要的。

将列名称作为列表使得 CountVectorizer 由于某种原因只能看到一项

关于python - sklearn : Text and Numeric features with ColumnTransformer has value error,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54541490/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com