gpt4 book ai didi

python - 使用管道时 MSE 错误

转载 作者:行者123 更新时间:2023-12-04 08:26:42 24 4
gpt4 key购买 nike

我试图从我抓取的数据集中预测一些价格。我从未为此使用过 Python(我通常使用 tidyverse ,但这次我想探索 pipeline
所以这是代码片段:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import numpy as np

df = pd.read_csv("https://raw.githubusercontent.com/norhther/idealista/main/idealistaBCN.csv")
df.drop("info", axis = 1, inplace = True)
df["floor"].fillna(1, inplace=True)
df.drop("neigh", axis = 1, inplace = True)
df.dropna(inplace = True)
df = df[df["habs"] < 11]
X = df.drop("price", axis = 1)
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
ct = ColumnTransformer(
[("standardScaler", StandardScaler(), ["habs", "m2", "floor"]),
("onehot", OneHotEncoder(), ["type"]
)], remainder="passthrough")

pipe = Pipeline(steps = [("Transformer", ct),
("svr", SVR())])

param_grid = {
"svr__kernel" : ['linear', 'poly', 'rbf', 'sigmoid'],
"svr__degree" : range(3,6),
"svr__gamma" : ['scale', 'auto'],
"svr__coef0" : np.linspace(0.01, 1, 2)
}

search = GridSearchCV(pipe, param_grid, scoring = ['neg_mean_squared_error'], refit='neg_mean_squared_error')

search.fit(X_train, y_train)
print(search.best_score_)

pipe = Pipeline(steps = [("Transformer", ct),
("svr", SVR(coef0 = search.best_params_["svr__coef0"],
degree = search.best_params_["svr__degree"],
kernel =

search.best_params_["svr__kernel"]))])

from sklearn.metrics import mean_squared_error

pipe.fit(X_train, y_train)
preds = pipe.predict(X_train)
mean_squared_error(preds, y_train)
search.best_score_这是 -443829697806.1671 ,以及 MSE608953977916.3896我想我搞砸了一些事情,也许是变压器,但我不完全确定。我认为这是一个夸张的 MSE .我用 tidymodels 做了一个非常相似的方法我得到了更好的结果。
所以在这里我想知道变压器是否有问题,或者只是模型如此糟糕。

最佳答案

原因是你没有在参数中包含 C,你需要覆盖整个 Cs 范围才能适应。如果我们用默认的 C = 1 拟合它,你就可以看到问题所在:

import matplotlib.pyplot as plt
o = pipe.named_steps["Transformer"].fit_transform(X_train)
mdl = SVR(C=1)
mdl.fit(o,y_train)
plt.scatter(mdl.predict(o),y_train)
enter image description here
有些价格值是平均值的 10 倍(1e7 与 5e5 的中位数)。如果您使用 mse 或 r^2,这些将在很大程度上取决于这些极端值。所以我们需要更密切地跟踪数据,这是由 C 决定的,你可以 read more about here .我们尝试一个范围:
ct = ColumnTransformer(
[("standardScaler", StandardScaler(), ["habs", "m2", "floor"]),
("onehot", OneHotEncoder(), ["type"]
)], remainder="passthrough")

pipe = Pipeline(steps = [("Transformer", ct),
("svr", SVR())])

#, 'poly', 'rbf', 'sigmoid'
param_grid = {
"svr__kernel" : ['rbf'],
"svr__gamma" : ['auto'],
"svr__coef0" : [1,2],
"svr__C" : [1e-03,1e-01,1e1,1e3,1e5,1e7]
}

search = GridSearchCV(pipe, param_grid, scoring = ['neg_mean_squared_error'],
refit='neg_mean_squared_error')

search.fit(X_train, y_train)
print(search.best_score_)
-132061065775.25969
您的 y 值很高并且 MSE 值将在您的 y 值的方差范围内,因此如果我们检查:
y_train.var()
545423126823.4545

132061065775.25969 / y_train.var()
0.24212590057261346
没关系,您将 MSE 减少到方差的 25% 左右。我们可以用测试数据来检查这一点,我想在这种情况下,C 值非常好是非常幸运的:
from sklearn.metrics import mean_squared_error

o = pipe.named_steps["Transformer"].fit_transform(X_train)
mdl = SVR(C=10000000.0, coef0=1, gamma='auto')
mdl.fit(o,y_train)

o_test = pipe.named_steps["Transformer"].fit_transform(X_test)

pred = mdl.predict(o_test)
print( mean_squared_error(pred,y_test) , mean_squared_error(pred,y_test)/y_test.var())
plt.scatter(mdl.predict(o_test),y_test)
enter image description here

关于python - 使用管道时 MSE 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65223473/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com