gpt4 book ai didi

python - 交叉验证时关键错误不在索引中

转载 作者:太空狗 更新时间:2023-10-30 01:47:34 27 4
gpt4 key购买 nike

我已经在我的数据集上应用了 svm。我的数据集是多标签的,这意味着每个观察值都有多个标签。

KFold 交叉验证时,它会引发错误not in index

它显示从 601 到 6007 的索引不在索引中(我有 1...6008 个数据样本)。

这是我的代码:

   df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']
X= df[['sentences']]
y = df[['ADR','WD','EF','INF','SSI','DI','others']]
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]

SVC_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])

for category in categories:
print('... Processing {} '.format(category))
# train the model using X_dtm & y
SVC_pipeline.fit(X_train['sentences'], y_train[category])

prediction = SVC_pipeline.predict(X_test['sentences'])
print('SVM Linear Test accuracy is {} '.format(accuracy_score(X_test[category], prediction)))
print 'SVM Linear f1 measurement is {} '.format(f1_score(X_test[category], prediction, average='weighted'))
print([{X_test[i]: categories[prediction[i]]} for i in range(len(list(prediction)))])

其实我不知道如何应用KFold交叉验证,我可以分别得到每个标签的F1分数和准确率。看了thisthis没有帮助我如何成功申请我的案子。

为了可重现,这是数据框的一个小样本最后七个特征是我的标签,包括 ADR、WD、...

,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,0
1,I am detoxing from Lexapro now.,0,0,0,0,0,0,1
2,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,1
3,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,1
4,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,0
5,I have no idea when this will end.,0,0,0,0,0,0,1

更新

当我按照 Vivek Kumar 所说的去做时它引发了错误

ValueError: Found input variables with inconsistent numbers of samples: [1, 5408]

在分类器部分。你知道如何解决吗?

在 stackoverflow 中有几个链接说明我需要 reshape 训练数据。我也这样做了,但没有成功 link谢谢:)

最佳答案

train_index, test_index 是基于行数的整数索引。但是 Pandas 索引不是那样工作的。较新版本的 pandas 在如何切片或从中选择数据方面更加严格。

您需要使用.iloc 来访问数据。更多信息是available here

这是你需要的:

for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]

...
...

# TfidfVectorizer dont work with DataFrame,
# because iterating a DataFrame gives the column names, not the actual data
# So specify explicitly the column name, to get the sentences

SVC_pipeline.fit(X_train['sentences'], y_train[category])

prediction = SVC_pipeline.predict(X_test['sentences'])

关于python - 交叉验证时关键错误不在索引中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51852551/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com