gpt4 book ai didi

python - 手动拆分与 Scikit 网格搜索

转载 作者:太空狗 更新时间:2023-10-29 21:23:56 26 4
gpt4 key购买 nike

当依赖训练集和测试集之间的数据“手动”拆分并使用 scikit-learn 网格搜索功能时,我对获得看似截然不同的结果感到困惑。我在两次运行中都使用了来自 kaggle 竞赛的评估函数,并且网格搜索基于单个值(与手动拆分的值相同)。生成的 gini 值如此不同,一定是某处有错误,但我没有看到它,并且想知道我在比较中是否有疏忽?

为我运行的第一个代码块导致 gini 仅为“验证样本分数:0.0033997889(规范化 gini)。

第二个 block (使用 scikit)产生更高的值:

Fitting 2 folds for each of 1 candidates, totalling 2 fits
0.334467621189
0.339421569449
[Parallel(n_jobs=-1)]: Done 3 out of 2 | elapsed: 9.9min remaining: -198.0s
[Parallel(n_jobs=-1)]: Done 2 out of 2 | elapsed: 9.9min finished
{'n_estimators': 1000}
0.336944643888
[mean: 0.33694, std: 0.00248, params: {'n_estimators': 1000}]

求值函数:

def gini(solution, submission):
df = zip(solution, submission)
df = sorted(df, key=lambda x: (x[1],x[0]), reverse=True)
rand = [float(i+1)/float(len(df)) for i in range(len(df))]
totalPos = float(sum([x[0] for x in df]))
cumPosFound = [df[0][0]]
for i in range(1,len(df)):
cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
Lorentz = [float(x)/totalPos for x in cumPosFound]
Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
return sum(Gini)

def normalized_gini(solution, submission):
normalized_gini = gini(solution, submission)/gini(solution, solution)
print normalized_gini
return normalized_gini


gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)

block 1:

if __name__ == '__main__':

dat=pd.read_table('train.csv',sep=",")

y=dat[['Hazard']].values.ravel()
dat=dat.drop(['Hazard','Id'],axis=1)

#sample out 30% for validation
folds=train_test_split(range(len(y)),test_size=0.3) #30% test
train_X=dat.iloc[folds[0],:]
train_y=y[folds[0]]
test_X=dat.iloc[folds[1],:]
test_y=y[folds[1]]


#assume no leakage by OH whole data
dat_dict=train_X.T.to_dict().values()
vectorizer = DV( sparse = False )
vectorizer.fit( dat_dict )
train_X = vectorizer.transform( dat_dict )

del dat_dict

dat_dict=test_X.T.to_dict().values()
test_X = vectorizer.transform( dat_dict )

del dat_dict



rf=RandomForestRegressor(n_estimators=1000, n_jobs=-1)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)
print "Validation Sample Score: %.10f (normalized gini)." % normalized_gini(test_y,y_submission)

block 2:

dat_dict=dat.T.to_dict().values()
vectorizer = DV( sparse = False )
vectorizer.fit( dat_dict )
X = vectorizer.transform( dat_dict )

parameters= {'n_estimators': [1000]}
grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters,cv=2, verbose=1, scoring=gini_scorer,n_jobs=-1)
grid_search.fit(X,y)

print grid_search.best_params_
print grid_search.best_score_
print grid_search.grid_scores_

编辑

这是一个独立的例子,我得到了同样的差异。

from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston




if __name__ == '__main__':

b=load_boston()
X = pd.DataFrame(b.data)
y = b.target

#sample out 30% for validation
folds=train_test_split(range(len(y)),test_size=0.5) #50% test
train_X=X.iloc[folds[0],:]
train_y=y[folds[0]]
test_X=X.iloc[folds[1],:]
test_y=y[folds[1]]


rf=RandomForestRegressor(n_estimators=1000, n_jobs=-1)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)

print "Validation Sample Score: %.10f (mean squared)." % mean_squared_error(test_y,y_submission)


parameters= {'n_estimators': [1000]}
grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters,cv=2, verbose=1, scoring='mean_squared_error',n_jobs=-1)
grid_search.fit(X,y)

print grid_search.best_params_
print grid_search.best_score_
print grid_search.grid_scores_

最佳答案

不确定我能否为您提供完整的解决方案,但这里有一些建议:

  1. 在调试此类问题时使用 scikit-learn 对象的 random_state 参数,因为它使您的结果真实可重现。以下将始终返回完全相同的数字:

    rf=RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=0)
    rf.fit(train_X,train_y)
    y_submission=rf.predict(test_X)
    mean_squared_error(test_y,y_submission)

它重置随机数生成器以确保您始终获得“相同的随机性”。您也应该在 train_test_splitGridSearchCV 上使用它。

  1. 您在独立示例中获得的结果是正常的。通常我得到:

    Validation Sample Score: 9.8136434847 (mean squared).
    [mean: -22.38918, std: 11.56372, params: {'n_estimators': 1000}]

首先,请注意从 GridSearchCV 返回的均方误差是一个负均方误差。我认为这是为了保持分数函数的精神而设计的(对于分数,越大越好)。

现在这仍然是 9.81 对 22.38。然而这里的标准偏差是巨大的。它可以解释分数看起来如此不同。如果你想检查 GridSearchCV 没有做一些可疑的事情,你可以强制它只使用一个分割,和你的手动分割一样:

from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston

if __name__ == '__main__':
b=load_boston()
X = pd.DataFrame(b.data)
y = b.target
folds=train_test_split(range(len(y)),test_size=0.5, random_state=15) #50% test
folds_split = np.ones_like(y)
folds_split[folds[0]] = -1
ps = PredefinedSplit(folds_split)

for tr, te in ps:
train_X=X.iloc[tr,:]
train_y=y[tr]
test_X=X.iloc[te,:]
test_y=y[te]
rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)
print("Validation Sample Score: {:.10f} (mean squared).".format(mean_squared_error(test_y, y_submission)))

parameters= {'n_estimators': [1000], 'n_jobs': [1], 'random_state': [15]}
grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters,cv=ps, verbose=2, scoring='mean_squared_error', n_jobs=1)
grid_search.fit(X,y)

print("best_params: ", grid_search.best_params_)
print("best_score", grid_search.best_score_)
print("grid_scores", grid_search.grid_scores_)

希望对您有所帮助。

抱歉,我无法弄清楚您的 Gini 计分器发生了什么。我会说 0.0033xxx 似乎是一个非常低的值(几乎没有模型?)对于标准化的基尼分数。

关于python - 手动拆分与 Scikit 网格搜索,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31387736/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com