gpt4 book ai didi

python - 使用数字、分类和文本管道制作 ColumnTransformer

转载 作者:行者123 更新时间:2023-12-04 09:39:34 24 4
gpt4 key购买 nike

我正在尝试制作一个处理数字、分类和文本变量的管道。我希望在运行分类器之前将数据输出到新的数据帧。我收到以下错误

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2499 and the array at index 2 has size 1.



请注意,2499 是我的训练数据的大小。如果我删除 text_preprocessing我的代码工作的管道的一部分。任何想法如何让这个工作?谢谢!
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(strategy='constant', fill_value='?')),
('One Hot Encoding', OneHotEncoder(handle_unknown='ignore')),
]
)

# Numeric pipeline
numeric_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(strategy='mean')),
('Scaling', StandardScaler())
]
)

text_preprocessing = Pipeline(
[
('Text',TfidfVectorizer())
]
)

# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(numeric_features, numeric_preprocessing),
(categorical_features, categorical_preprocessing),
(text_features,text_preprocessing),
)

# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)

test = pipeline.fit_transform(x_train)

最佳答案

我想您已经尝试过交换 make_column_transformer 中的功能和管道但是在您发布问题时没有将其改回。

考虑到您的顺序正确( estimator ,列/秒),
当在 ColumnTransformer 中为向量化器提供列名列表时,会发生此错误。因为sklearn中的所有vectorisers只取一维数据/迭代器/pd.Series ,它不能像这样处理/申请多列。

例子:

import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

x_train = pd.DataFrame({'fruit': ['apple','orange', np.nan],
'score': [np.nan, 12, 98],
'summary': ['Great performance',
'fantastic performance',
'Could have been better']}
)

# Categorical pipeline
categorical_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(strategy='constant', fill_value='?')),
('One Hot Encoding', OneHotEncoder(handle_unknown='ignore')),
]
)

# Numeric pipeline
numeric_preprocessing = Pipeline(
[
('Imputation', SimpleImputer(strategy='mean')),
('Scaling', StandardScaler())
]
)

text_preprocessing = Pipeline(
[
('Text',TfidfVectorizer())
]
)

# Creating preprocessing pipeline
preprocessing = make_column_transformer(
(numeric_preprocessing, ['score']),
(categorical_preprocessing, ['fruit']),
(text_preprocessing, 'summary'),
)

# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)

test = pipeline.fit_transform(x_train)

如果我改变
    (text_preprocessing, 'summary'),


    (text_preprocessing, ['summary']),

它抛出一个

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 3 and the array at index 2 has size 1

关于python - 使用数字、分类和文本管道制作 ColumnTransformer,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62391670/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com