gpt4 book ai didi

python - 值错误 : unknown is not supported in sklearn. RFECV

转载 作者:太空狗 更新时间:2023-10-29 22:26:37 25 4
gpt4 key购买 nike

我试图使用 rfecv 缩小与我的分类器真正相关的特征的数量。这是我写的代码

import sklearn
import pandas as p
import numpy as np
import scipy as sp
import pylab as pl
from sklearn import linear_model, cross_validation, metrics
from sklearn.svm import SVC
from sklearn.feature_selection import RFECV
from sklearn.metrics import zero_one_loss
from sklearn import preprocessing
#from sklearn.feature_extraction.text import CountVectorizer
#from sklearn.feature_selection import SelectKBest, chi2

modelType = "notext"

# ----------------------------------------------------------
# Prepare the Data
# ----------------------------------------------------------
training_data = np.array(p.read_table('F:/NYC/NYU/SM/3/SNLP/Project/Data/train.tsv'))
print ("Read Data\n")

# get the target variable and set it as Y so we can predict it
Y = training_data[:,-1]

print(Y)

# not all data is numerical, so we'll have to convert those fields
# fix "is_news":
training_data[:,17] = [0 if x == "?" else 1 for x in training_data[:,17]]

# fix -1 entries in hasDomainLink
training_data[:,14] = [0 if x =="-1" else x for x in training_data[:,10]]

# fix "news_front_page":
training_data[:,20] = [999 if x == "?" else x for x in training_data[:,20]]
training_data[:,20] = [1 if x == "1" else x for x in training_data[:,20]]
training_data[:,20] = [0 if x == "0" else x for x in training_data[:,20]]

# fix "alchemy category":
training_data[:,3] = [0 if x=="arts_entertainment" else x for x in training_data[:,3]]
training_data[:,3] = [1 if x=="business" else x for x in training_data[:,3]]
training_data[:,3] = [2 if x=="computer_internet" else x for x in training_data[:,3]]
training_data[:,3] = [3 if x=="culture_politics" else x for x in training_data[:,3]]
training_data[:,3] = [4 if x=="gaming" else x for x in training_data[:,3]]
training_data[:,3] = [5 if x=="health" else x for x in training_data[:,3]]
training_data[:,3] = [6 if x=="law_crime" else x for x in training_data[:,3]]
training_data[:,3] = [7 if x=="recreation" else x for x in training_data[:,3]]
training_data[:,3] = [8 if x=="religion" else x for x in training_data[:,3]]
training_data[:,3] = [9 if x=="science_technology" else x for x in training_data[:,3]]
training_data[:,3] = [10 if x=="sports" else x for x in training_data[:,3]]
training_data[:,3] = [11 if x=="unknown" else x for x in training_data[:,3]]
training_data[:,3] = [12 if x=="weather" else x for x in training_data[:,3]]
training_data[:,3] = [999 if x=="?" else x for x in training_data[:,3]]

print ("Corrected outliers data\n")

# ----------------------------------------------------------
# Models
# ----------------------------------------------------------
if modelType == "notext":
print ("no text model\n")
#ignore features which are useless
X = training_data[:,list([3, 5, 6, 7, 8, 9, 10, 14, 15, 16, 17, 19, 20, 22, 25])]
scaler = preprocessing.StandardScaler()
print("initialized scaler \n")
scaler.fit(X,Y)
print("fitted train data and labels\n")
X = scaler.transform(X)
print("Transformed train data\n")
svc = SVC(kernel = "linear")
print("Initialized SVM\n")
rfecv = RFECV(estimator = svc, cv = 5, loss_func = zero_one_loss, verbose = 1)
print("Initialized RFECV\n")
rfecv.fit(X,Y)
print("Fitted train data and label\n")
rfecv.support_
print ("Optimal Number of features : %d" % rfecv.n_features_)
savetxt('rfecv.csv', rfecv.ranking_, delimiter=',', fmt='%f')

在调用“rfecv.fit(X,Y)”时,我的代码从 metrices.py 文件中抛出错误“ValueError:不支持未知”

sklearn.metrics.metrics 中出现错误:

# No metrics support "multiclass-multioutput" format
if (y_type not in ["binary", "multiclass", "multilabel-indicator", "multilabel-sequences"]):
raise ValueError("{0} is not supported".format(y_type))

这是一个分类问题,目标值只有0或1。数据集可以在 Kaggle Competition Data 找到

如果有人能指出我哪里出错了,我将不胜感激。

最佳答案

RFECV 检查目标/训练数据是否属于binarymulticlassmultilabel-indicator 类型之一或 多标签序列:

  • 'binary': y 包含 <= 2 个离散值并且是 1d 或一列矢量。
  • 'multiclass': y 包含两个以上的离散值,不是一个sequence 的序列,并且是 1d 或列向量。
  • 'mutliclass-multioutput': y 是一个二维数组,包含更多不是两个离散值,不是序列的序列,并且两者尺寸 > 1。
  • 'multilabel-indicator': y是一个标签指示矩阵,一个数组至少有两列的二维,最多 2 个唯一的值(value)观。

而你的Yunknown,即

  • 'unknown':y 是类数组但不是以上任何一种,例如 3d 数组,或非序列对象数组。

原因是您的目标数据是字符串(格式为 "0""1")并加载了 read_table作为对象:

>>> training_data[:, -1].dtype
dtype('O')
>>> type_of_target(training_data[:, -1])
'unknown'

为了解决这个问题,你可以转换为int:

>>> Y = training_data[:, -1].astype(int)
>>> type_of_target(Y)
'binary'

关于python - 值错误 : unknown is not supported in sklearn. RFECV,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20234851/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com