gpt4 book ai didi

python - 使用 scikit-learn 的 SGDRegressor 算法进行梯度下降

转载 作者:行者123 更新时间:2023-12-01 08:22:51 27 4
gpt4 key购买 nike

我正在租赁数据集上使用 scikit-learn 的 SGDRegressor 算法实现梯度下降,以根据面积预测租金,但得到奇怪的系数和截距,因此对租金的预测很奇怪。

<小时/>租赁数据集:rentals.csv(已完成列

area,bedrooms,furnished,cost
650,2,1,33000
750,3,0,55000
247,1,0,10500
1256,4,0,65000
900,3,0,37000
900,3,0,50000
550,2,0,30000
1700,4,0,72000
1300,4,0,45000
1600,4,2,57000
475,2,1,30000
800,3,0,45000
350,2,0,15000
247,1,0,11500
247,1,0,16500
247,1,0,15000
330,2,0,16000
450,2,2,25000
325,1,0,13500
1650,4,0,90000
650,2,0,31000
1650,4,0,65000
900,3,0,40000
635,2,0,30000
475,2,2,28000
1120,3,0,45000
1000,3,0,38000
900,3,2,50000
610,3,0,28000
400,2,0,17000

Python 代码,alpha 值 = .000001 且 max_iter=1000

import pandas
full_data = pandas.read_csv ("./rentals.csv")
rentals = pandas.DataFrame ({'area':full_data.area,'cost':full_data.cost})

from sklearn.model_selection import train_test_split
train, test = train_test_split (rentals, test_size=0.2, random_state=11)

trainX = pandas.DataFrame ({'area': train['area']})
trainY = pandas.DataFrame ({'cost': train['cost']})
testX = pandas.DataFrame ({'area': test['area']})
testY = pandas.DataFrame ({'cost': test['cost']})

from sklearn.linear_model import SGDRegressor
reg = SGDRegressor(max_iter=1000, alpha=.000001, tol=.0001)

reg.fit (trainX, trainY)

from sklearn.metrics import mean_squared_error, r2_score

print ('Coefficients: \n', reg.coef_)
print ('Intercept: \n', reg.intercept_)

yhat = reg.predict (testX)

print ('Mean squared error: \n', mean_squared_error (testY, yhat))
print ('Variance score: \n', r2_score (testY, yhat))

print('yhat :: ',yhat)

输出

Coefficients:
[-1.77569698e+12]
Intercept:
[2.20231032e+10]
Mean squared error:
2.7699546187784015e+30
Variance score:
-1.1843036374824519e+22
yhat :: [-4.38575131e+14 -2.30838405e+15 -9.76611316e+14 -1.77567496e+15
-2.23025338e+15 -1.42053556e+15]

当 Alpha = .00000001

reg = SGDRegressor(max_iter=1000, alpha=.00000001, tol=.0001)

输出

Coefficients:
[-1.35590231e+12]
Intercept:
[-9.70811558e+10]
Mean squared error:
1.6153367348228915e+30
Variance score:
-6.906427844848468e+21
yhat :: [-3.35004951e+14 -1.76277008e+15 -7.45843351e+14 -1.35599939e+15
-1.70311038e+15 -1.08481893e+15]

我已经尝试了所有值,直到 alpha = .00000000001

reg = SGDRegressor(max_iter=1000, alpha=.00000000001, tol=.0001)

输出

Coefficients:
[1.81827102e+12]
Intercept:
[8.5060188e+09]
Mean squared error:
2.9044685546452095e+30
Variance score:
-1.2418155340525837e+22
yhat :: [4.49121448e+14 2.36376083e+15 1.00005757e+15 1.81827952e+15
2.28375691e+15 1.45462532e+15]

请指出我的代码中有什么不正确的地方?为什么我得到的值不正确?

提前致谢。

最佳答案

代码没有任何明显的错误。有趣的是,如果我们将 SGDRegressor 替换为简单的 LinearRegression,结果看起来不错(coef = ~40,r2score = ~0.7)。数据中一定存在随机梯度不喜欢的东西。

相信正在发生的事情是,由于数据规模太大,梯度变得太大并且算法发散。

我们可以通过设置较低的学习率来验证这一点,迫使算法采取小步长,即使梯度很高:

reg = SGDRegressor(max_iter=1000, alpha=.000001, tol=.0001, learning_rate='constant', eta0=1e-7)

# Coefficients: [46.75739932]
# Intercept: [0.11470854]
# Mean squared error: 75520077.45401965
# Variance score: 0.6771113077975406

这看起来更好,但它可能不是一个理想的解决方案,因为如果学习率较低,大数据集的训练可能需要很长时间。相反,让我们对数据的规模做一些事情:

mu_x = trainX.mean().values
mu_y = trainY.mean().values
scale_x = trainX.std().values
scale_y = trainY.std().values

trainX = (trainX - mu_x) / scale_x
trainY = (trainY - mu_y) / scale_y

reg = SGDRegressor(max_iter=1000, alpha=.000001, tol=.0001)

reg.fit(trainX, trainY)

yhat = reg.predict((testX - mu_x) / scale_x) * scale_y + mu_y

# Coefficients: [0.89319654]
# Intercept: [0.00064678]
# Mean squared error: 59575772.471740596
# Variance score: 0.7452817328999215

对数据进行居中和重新缩放有很大帮助。还有scikit-learn StandardScaler但我喜欢展示手动方法来说明正在发生的事情。

关于python - 使用 scikit-learn 的 SGDRegressor 算法进行梯度下降,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54512622/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com