gpt4 book ai didi

tensorflow - 深度学习回归——巨大的均方误差和损失

转载 作者:行者123 更新时间:2023-11-30 08:45:14 25 4
gpt4 key购买 nike

我正在尝试训练一个模型来预测汽车价格。数据集来自kaggle: https://www.kaggle.com/vfsousas/autos#autos.csv

我正在使用以下代码准备数据:

class CarDataset(DataSet):

def __init__(self, csv_file):
df = pd.read_csv(csv_file).drop(["dateCrawled", "name", "abtest", "dateCreated", "nrOfPictures", "postalCode", "lastSeen"], axis = 1)

df = df.drop(df[df["seller"] == "gewerblich"].index).drop(["seller"], axis = 1)
df = df.drop(df[df["offerType"] == "Gesuch"].index).drop(["offerType"], axis = 1)

df = df[df["vehicleType"].notnull()]
df = df[df["notRepairedDamage"].notnull()]
df = df[df["model"].notnull()]
df = df[df["fuelType"].notnull()]

df = df[(df["price"] > 100) & (df["price"] < 100000)]
df = df[(df["monthOfRegistration"] > 0) & (df["monthOfRegistration"] < 13)]
df = df[(df["yearOfRegistration"] < 2019) & (df["yearOfRegistration"] > 1950)]
df = df[(df["powerPS"] > 20) & (df["powerPS"] < 550)]

df["hasDamage"] = np.where(df["notRepairedDamage"] == "ja", 1, 0)
df["automatic"] = np.where(df["gearbox"] == "manuell", 1, 0)
df["fuel"] = np.where(df["fuelType"] == "benzin", 0, 1)
df["age"] = (2019 - df["yearOfRegistration"]) * 12 + df["monthOfRegistration"]

df = df.drop(["notRepairedDamage", "gearbox", "fuelType", "yearOfRegistration", "monthOfRegistration"], axis = 1)

df = pd.get_dummies(df, columns = ["vehicleType", "model", "brand"])

self.df = df
self.Y = self.df["price"].values
self.X = self.df.drop(["price"], axis = 1).values

scaler = StandardScaler()
scaler.fit(self.X)

self.X = scaler.transform(self.X)

self.x_train, self.x_test, self.y_train, self.y_test = train_test_split(self.X,
self.Y,
test_size = 0.25,
random_state = 0)

self.x_train, self.x_valid, self.y_train, self.y_valid = train_test_split(self.x_train,
self.y_train,
test_size = 0.25,
random_state = 0)

def get_input_shape(self):
return (len(self.df.columns)-1, ) # (303, )

这将产生以下准备好的数据集:

    price  powerPS  kilometer  hasDamage  automatic  fuel  age  vehicleType_andere  vehicleType_bus  vehicleType_cabrio  vehicleType_coupe  ...  brand_rover  brand_saab  brand_seat  brand_skoda  brand_smart  brand_subaru  brand_suzuki  brand_toyota  brand_trabant  brand_volkswagen  brand_volvo
3 1500 75 150000 0 1 0 222 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1 0
4 3600 69 90000 0 1 1 139 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0 0
5 650 102 150000 1 1 0 298 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0
6 2200 109 150000 0 1 0 188 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0 0
10 2000 105 150000 0 1 0 192 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0

[5 rows x 304 columns]

hasDamage 是一个标志(0 或 1),指示汽车是否有未修复的损坏
automatic 是一个标志(0 或 1),指示汽车是否具有手动或自动换档
fuel 柴油为 0,汽油为 1
age 是汽车的车龄(以月为单位)

brandmodelvehicleType 将使用 df = pd.get_dummies(df, columns = [“车辆类型”、“型号”、“品牌”])

此外,我将使用 StandardScaler 来转换 X 值。

数据集现在包含 303 列 X,当然 Y 是“价格”列。

使用此数据集,常规LinearRegression 将在训练和测试集上获得约 0.7 的分数。

现在我尝试了使用 keras 的深度学习方法,但无论我做什么,mse 和损失都在不断增加,并且模型似乎无法学习任何东西:

input_tensor = model_stack = Input(dataset.get_input_shape()) # (303, )
model_stack = Dense(20)(model_stack)
model_stack = Activation("relu", name = "relu_1")(model_stack)

model_stack = Dense(20)(model_stack)
model_stack = Activation("relu", name = "relu_2")(model_stack)

model_stack = Dense(1, name = "Output")(model_stack)

model = Model(inputs = [input_tensor], outputs = [model_stack])
model.compile(loss = "mse", optimizer = optimizer(lr = learning_rate), metrics = ['mse'])

model.summary()

callbacks = []
callbacks.append(ReduceLROnPlateau(monitor = "val_loss", factor = 0.95, verbose = self.verbose, patience = 1))
callbacks.append(EarlyStopping(monitor='val_loss', patience = 5, min_delta = 0.01, restore_best_weights = True, verbose = self.verbose))


model.fit(x = dataset.x_train,
y = dataset.y_train,
verbose = 1,
batch_size = 128,
epochs = 200,
validation_data = [dataset.x_valid, dataset.y_valid],
callbacks = callbacks)

score = model.evaluate(dataset.x_test, dataset.y_test, verbose = 1)
print("Model score: {}".format(score))

总结/训练如下(学习率为3e-4):

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 6) 0
_________________________________________________________________
dense_1 (Dense) (None, 20) 140
_________________________________________________________________
relu_1 (Activation) (None, 20) 0
_________________________________________________________________
dense_2 (Dense) (None, 20) 420
_________________________________________________________________
relu_2 (Activation) (None, 20) 0
_________________________________________________________________
Output (Dense) (None, 1) 21
=================================================================
Total params: 581
Trainable params: 581
Non-trainable params: 0
_________________________________________________________________
Train on 182557 samples, validate on 60853 samples
Epoch 1/200
182557/182557 [==============================] - 2s 13us/step - loss: 110046953.4602 - mean_squared_error: 110046953.4602 - acc: 0.0000e+00 - val_loss: 107416331.4062 - val_mean_squared_error: 107416331.4062 - val_acc: 0.0000e+00
Epoch 2/200
182557/182557 [==============================] - 2s 11us/step - loss: 97859920.3050 - mean_squared_error: 97859920.3050 - acc: 0.0000e+00 - val_loss: 85956634.8803 - val_mean_squared_error: 85956634.8803 - val_acc: 1.6433e-05
Epoch 3/200
182557/182557 [==============================] - 2s 12us/step - loss: 70531052.0493 - mean_squared_error: 70531052.0493 - acc: 2.1911e-05 - val_loss: 54933938.6787 - val_mean_squared_error: 54933938.6787 - val_acc: 3.2866e-05
Epoch 4/200
182557/182557 [==============================] - 2s 11us/step - loss: 42639802.3204 - mean_squared_error: 42639802.3204 - acc: 3.2866e-05 - val_loss: 32645940.6536 - val_mean_squared_error: 32645940.6536 - val_acc: 1.3146e-04
Epoch 5/200
182557/182557 [==============================] - 2s 11us/step - loss: 28282909.0699 - mean_squared_error: 28282909.0699 - acc: 1.4242e-04 - val_loss: 25315220.7446 - val_mean_squared_error: 25315220.7446 - val_acc: 9.8598e-05
Epoch 6/200
182557/182557 [==============================] - 2s 11us/step - loss: 24279169.5270 - mean_squared_error: 24279169.5270 - acc: 3.8344e-05 - val_loss: 23420569.2554 - val_mean_squared_error: 23420569.2554 - val_acc: 9.8598e-05
Epoch 7/200
182557/182557 [==============================] - 2s 11us/step - loss: 22874003.0459 - mean_squared_error: 22874003.0459 - acc: 9.8599e-05 - val_loss: 22380401.0622 - val_mean_squared_error: 22380401.0622 - val_acc: 1.6433e-05
...
Epoch 197/200
182557/182557 [==============================] - 2s 12us/step - loss: 13828827.1595 - mean_squared_error: 13828827.1595 - acc: 3.3414e-04 - val_loss: 14123447.1746 - val_mean_squared_error: 14123447.1746 - val_acc: 3.1223e-04

Epoch 00197: ReduceLROnPlateau reducing learning rate to 0.00020950120233464986.
Epoch 198/200
182557/182557 [==============================] - 2s 13us/step - loss: 13827193.5994 - mean_squared_error: 13827193.5994 - acc: 2.4102e-04 - val_loss: 14116898.8054 - val_mean_squared_error: 14116898.8054 - val_acc: 1.6433e-04

Epoch 00198: ReduceLROnPlateau reducing learning rate to 0.00019902614221791736.
Epoch 199/200
182557/182557 [==============================] - 2s 12us/step - loss: 13823582.4300 - mean_squared_error: 13823582.4300 - acc: 3.3962e-04 - val_loss: 14108715.5067 - val_mean_squared_error: 14108715.5067 - val_acc: 4.1083e-04
Epoch 200/200
182557/182557 [==============================] - 2s 11us/step - loss: 13820568.7721 - mean_squared_error: 13820568.7721 - acc: 3.1223e-04 - val_loss: 14106001.7681 - val_mean_squared_error: 14106001.7681 - val_acc: 2.3006e-04
60853/60853 [==============================] - 1s 18us/step
Model score: [14106001.790199332, 14106001.790199332, 0.00023006260989597883]

我还是机器学习的初学者。我的方法有什么大/明显的错误吗?我做错了什么?

最佳答案

解决方案

所以,过了一会儿,我找到了指向正确数据集的kaggle链接。我正在使用 https://www.kaggle.com/vfsousas/autos首先,但是相同的数据也是这样的:https://www.kaggle.com/orgesleka/used-cars-database连同222个内核一起看一下。现在查看https://www.kaggle.com/themanchanda/neural-network-approach表明这个人也得到了“大数字”的损失,这是我困惑的主要部分(因为到目前为止我只处理了“较小的数字”或“准确性”)并让我再次思考。

然后我就很清楚了:

  • 数据集已正确准备
  • 模型运行正常
  • 我使用了错误的指标/与 sklearn 的其他指标进行比较LinearRegression无论如何都没有可比性

简而言之:

  • 2000 年左右的 MAE(平均绝对误差)意味着,对于汽车价格的预测,平均来说,它会偏离/错误 2000 欧元(例如,正确的价格为 10.000 欧元,而模型预测为 8.000 欧元 - 12.000 欧元)
  • MSE(均方误差)当然是一个更大的数字,这是可以预料的,而不是我最初解释的“垃圾”或错误的模型结果
  • “准确度”指标用于分类,对回归毫无用处
  • 默认评分函数sklearn LinearRegression是 r2 分数

因此,我将指标更改为“mae”和自定义 r2 实现,这样我就可以将其与 LinearRegression 进行比较.
事实证明,在第一次尝试大约 100 个 epoch 后,我最终获得了 1900 的 MAE 和 0.69 的 r2 分数。

然后我还计算了 LinearRegression 的 MAE出于比较目的,其评估结果为 2855.417(r2 分数为 0.67)。

事实上,深度学习方法在 MAE 和 r2 分数方面都已经更好了。因此,没有任何问题,我现在可以继续调整/优化模型:)

关于tensorflow - 深度学习回归——巨大的均方误差和损失,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57266759/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com