gpt4 book ai didi

python - 在不同的数据集上运行经过训练的机器学习模型

转载 作者:行者123 更新时间:2023-11-30 08:40:33 28 4
gpt4 key购买 nike

我是机器学习新手,正在尝试在另一个相同格式的数据集上运行一个简单的分类模型,该模型是我使用 pickle 训练和保存的。我有以下 python 代码。

代码

#Training set
features = pd.read_csv('../Data/Train_sop_Computed.csv')
#Testing set
testFeatures = pd.read_csv('../Data/Test_sop_Computed.csv')

print(colored('\nThe shape of our features is:','green'), features.shape)
print(colored('\nThe shape of our Test features is:','green'), testFeatures.shape)

features = pd.get_dummies(features)
testFeatures = pd.get_dummies(testFeatures)

features.iloc[:,5:].head(5)
testFeatures.iloc[:,5].head(5)

labels = np.array(features['Truth'])
testlabels = np.array(testFeatures['Truth'])

features= features.drop('Truth', axis = 1)
testFeatures = testFeatures.drop('Truth', axis = 1)

feature_list = list(features.columns)
testFeature_list = list(testFeatures.columns)

def add_missing_dummy_columns(d, columns):
missing_cols = set(columns) - set(d.columns)
for c in missing_cols:
d[c] = 0


def fix_columns(d, columns):
add_missing_dummy_columns(d, columns)

# make sure we have all the columns we need
assert (set(columns) - set(d.columns) == set())

extra_cols = set(d.columns) - set(columns)
if extra_cols: print("extra columns:", extra_cols)

d = d[columns]
return d


testFeatures = fix_columns(testFeatures, features.columns)

features = np.array(features)
testFeatures = np.array(testFeatures)

train_samples = 100

X_train, X_test, y_train, y_test = model_selection.train_test_split(features, labels, test_size = 0.25, random_state = 42)
testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)

print(colored('\n TRAINING SET','yellow'))
print(colored('\nTraining Features Shape:','magenta'), X_train.shape)
print(colored('Training Labels Shape:','magenta'), X_test.shape)
print(colored('Testing Features Shape:','magenta'), y_train.shape)
print(colored('Testing Labels Shape:','magenta'), y_test.shape)

print(colored('\n TESTING SETS','yellow'))
print(colored('\nTraining Features Shape:','magenta'), testX_train.shape)
print(colored('Training Labels Shape:','magenta'), textX_test.shape)
print(colored('Testing Features Shape:','magenta'), testy_train.shape)
print(colored('Testing Labels Shape:','magenta'), testy_test.shape)

from sklearn.metrics import precision_recall_fscore_support

import pickle

loaded_model_RFC = pickle.load(open('../other/SOPmodel_RFC', 'rb'))
result_RFC = loaded_model_RFC.score(textX_test, testy_test)
print(colored('Random Forest Classifier: ','magenta'),result_RFC)

loaded_model_SVC = pickle.load(open('../other/SOPmodel_SVC', 'rb'))
result_SVC = loaded_model_SVC.score(textX_test, testy_test)
print(colored('Support Vector Classifier: ','magenta'),result_SVC)

loaded_model_GPC = pickle.load(open('../other/SOPmodel_Gaussian', 'rb'))
result_GPC = loaded_model_GPC.score(textX_test, testy_test)
print(colored('Gaussian Process Classifier: ','magenta'),result_GPC)

loaded_model_SGD = pickle.load(open('../other/SOPmodel_SGD', 'rb'))
result_SGD = loaded_model_SGD.score(textX_test, testy_test)
print(colored('Stocastic Gradient Descent: ','magenta'),result_SGD)

我能够获得测试集的结果。

But the problem I am facing is that I need to run the model on the entire Test_sop_Computed.csv dataset. But it is only being run on the test dataset that I've split. I would sincerely appreciate if anyone could provide any suggestions on how I can run the loaded model on the entire dataset. I know that I'm going wrong with the following line of code.

testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)

训练和测试数据集都有SubjectPredicateObjectComputed Truth 以及以 Truth 作为预测类别的特征。测试数据集具有此 Truth 列的实际值,我使用 testFeatures = testFeatures.drop('Truth', axis = 1) 进行测试,并打算使用各种加载分类器模型,以将整个数据集的 Truth 预测为 01,然后以数组形式获取预测。

到目前为止我已经做到了这一点。但我认为我也在分割我的测试数据集。有没有办法通过整个测试数据集,即使它在另一个文件中?

此测试数据集与训练集的格式相同。我检查了两者的形状,得到以下结果。

确认特征和形状

Shape of the Train features is: (1860, 5)
Shape of the Test features is: (1386, 5)

TRAINING SET

Training Features Shape: (1395, 1045)
Training Labels Shape: (465, 1045)
Testing Features Shape: (1395,)
Testing Labels Shape: (465,)

TEST SETS

Training Features Shape: (1039, 1045)
Training Labels Shape: (347, 1045)
Testing Features Shape: (1039,)
Testing Labels Shape: (347,)

任何这方面的建议都将受到高度赞赏。

最佳答案

您的问题有点不清楚,但据我了解,您希望在 testX_traintestX_test 上运行模型(这只是 testFeatures> 分为两个子数据集)。

因此,您可以按照与 testX_test 相同的方式在 testX_train 上运行模型,例如:

result_RFC_train = returned_model_RFC.score(textX_train, testy_train)

或者您可以删除以下行:

testX_train、textX_test、testy_train、testy_test = model_selection.train_test_split(testFeatures、teSTLabels、test_size= 0.25、random_state = 42)

因此,您只需不要拆分数据并在完整数据集上运行它即可:

result_RFC_train = returned_model_RFC.score(testFeatures, teSTLabels)

关于python - 在不同的数据集上运行经过训练的机器学习模型,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53740141/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com