gpt4 book ai didi

python - 如何在需要 pd.get_dummies 的新数据上运行模型

转载 作者:行者123 更新时间:2023-12-04 01:23:30 25 4
gpt4 key购买 nike

我有一个运行以下内容的模型:

import pandas as pd
import numpy as np

# initialize list of lists
data = [['tom', 10,1,'a'], ['tom', 15,5,'a'], ['tom', 14,1,'a'], ['tom', 15,4,'b'], ['tom', 18,1,'b'], ['tom', 15,6,'a'], ['tom', 17,3,'a']
, ['tom', 14,7,'b'], ['tom',16 ,6,'a'], ['tom', 22,2,'a'],['matt', 10,1,'c'], ['matt', 15,5,'b'], ['matt', 14,1,'b'], ['matt', 15,4,'a'], ['matt', 18,1,'a'], ['matt', 15,6,'a'], ['matt', 17,3,'a']
, ['matt', 14,7,'c'], ['matt',16 ,6,'b'], ['matt', 10,2,'b']]

# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Attempts','Score','Category'])

print(df.head(2))
Name Attempts Score Category
0 tom 10 1 a
1 tom 15 5 a

然后我使用以下代码创建了一个虚拟 df 以在模型中使用:
from sklearn.linear_model import LogisticRegression

df_dum = pd.get_dummies(df)
print(df_dum.head(2))
Attempts Score Name_matt Name_tom Category_a Category_b Category_c
0 10 1 0 1 1 0 0
1 15 5 0 1 1 0 0

然后我创建了以下模型:
#Model

X = df_dum.drop(('Score'),axis=1)
y = df_dum['Score'].values

#Training Size
train_size = int(X.shape[0]*.7)
X_train = X[:train_size]
X_test = X[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]


#Fit Model
model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)


#Send predictions back to dataframe
Z = model.predict(X_test)
zz = model.predict_proba(X_test)

df.loc[train_size:,'predictions']=Z
dfpredictions = df.dropna(subset=['predictions'])

print(dfpredictions)
Name Attempts Score Category predictions
14 matt 18 1 a 1.0
15 matt 15 6 a 1.0
16 matt 17 3 a 1.0
17 matt 14 7 c 1.0
18 matt 16 6 b 1.0
19 matt 10 2 b 1.0

现在我有我想预测的新数据:
newdata = [['tom', 10,'a'], ['tom', 15,'a'], ['tom', 14,'a']]

newdf = pd.DataFrame(newdata, columns = ['Name', 'Attempts','Category'])

print(newdf)

Name Attempts Category
0 tom 10 a
1 tom 15 a
2 tom 14 a

然后创建假人并运行预测
newpredict = pd.get_dummies(newdf)

predict = model.predict(newpredict)

输出:
ValueError: X has 3 features per sample; expecting 6

这是有道理的,因为没有类别 bc并且没有名字叫 matt .

我的问题是,鉴于我的新数据并不总是具有原始数据中使用的完整列集,如何设置此模型的最佳方法。每天我都有新数据,所以我不太确定最有效和无错误的方式。

这是一个示例数据 - 我的数据集在运行时有 2000 列 pd.get_dummies .非常感谢!

最佳答案

让我更详细地解释 Nicolas 和 BlueSkyz 的建议。
pd.get_dummies当您确定生产/新数据集中的特定分类变量不会有任何新类别时很有用,例如基于贵公司或数据库的内部数据分类/一致性规则的性别、产品等。

但是,对于大多数机器学习任务,您可以预期将来会有新类别未用于模型训练,sklearn.OneHotEncoder应该是标准选择。 handle_unknown sklearn.OneHotEncoder 的参数可以设置为 'ignore'做到这一点:在将来应用编码器时忽略新类别。来自 documentation :

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None



您的示例基于 LabelEncoding 和 OneHotEncoding 的完整流程如下:
# Create a categorical boolean mask
categorical_feature_mask = df.dtypes == object
# Filter out the categorical columns into a list for easy reference later on in case you have more than a couple categorical columns
categorical_cols = df.columns[categorical_feature_mask].tolist()

# Instantiate the OneHotEncoder Object
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse = False)
# Apply ohe on data
ohe.fit(df[categorical_cols])
cat_ohe = ohe.transform(df[categorical_cols])

#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(cat_ohe, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe = pd.concat([df, ohe_df], axis=1).drop(columns = categorical_cols, axis=1)

# The following code is for your newdf after training and testing on original df
# Apply ohe on newdf
cat_ohe_new = ohe.transform(newdf[categorical_cols])
#Create a Pandas DataFrame of the hot encoded column
ohe_df_new = pd.DataFrame(cat_ohe_new, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe_new = pd.concat([newdf, ohe_df_new], axis=1).drop(columns = categorical_cols, axis=1)

# predict on df_ohe_new
predict = model.predict(df_ohe_new)

输出(您可以分配回 newdf):
array([1, 1, 1])

但是,如果您真的想使用 pd.get_dummies只有,那么以下也可以工作:
newpredict = newpredict.reindex(labels = df_dum.columns, axis = 1, fill_value = 0).drop(columns = ['Score'])
predict = model.predict(newpredict)

上面的代码片段将确保您的新虚拟对象 df (newpredict) 中的列与原始 df_dum(具有 0 值)相同,并删除 'Score'柱子。这里的输出与上面相同。此代码将确保新数据集中存在但现在在原始训练数据中的任何分类值都将被删除,同时保持列的顺序与原始 df 中的顺序相同。

编辑:
我忘记添加的一件事是 pd.get_dummies执行速度通常比 sklearn.OneHotEncoder 快得多

关于python - 如何在需要 pd.get_dummies 的新数据上运行模型,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62240050/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com