gpt4 book ai didi

python - 将 (n_samples, n_features) ndarray 转换为 (n_samples, 1) 向量数组,用作 sklearn SVM 的训练标签

转载 作者:太空宇宙 更新时间:2023-11-03 21:42:02 26 4
gpt4 key购买 nike

我正在尝试计算我正在构建的 SVM 模型的 ROC 和 AUC。我正在关注 this sklearn example 中的代码。要求之一是输出标签 y 需要进行二值化。我通过创建 MultiLabelBinarizer 并对所有标签进行编码来实现此目的,效果很好。但是,这会创建一个 (n_samples, n_features) ndarray。 classifier.fit(X, y) 函数假设 y.shape = (n_samples)。我想本质上将 y 的列“混合”在一起,以便 y[0][0] 将返回整个特征向量,而不仅仅是V 的第一个值。

这是我的代码:

    enc = MultiLabelBinarizer()
print("Encoding data...")
# Fit the encoder onto all possible data values
print(pandas.DataFrame(enc.fit_transform(df["present"] + df["member"].apply(str).apply(lambda x: [x])),
columns=enc.classes_, index=df.index))
X, y = enc.transform(df["present"]), list(df["member"].apply(str))
print("Training svm...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)
y_train = enc.transform([[x] for x in y_train]) # Strings to 1HotVectors
svc = svm.SVC(C=1.1, kernel="linear", probability=True, class_weight='balanced')
svc.fit(X_train, y_train) # y_train should be 1D but isn't

我得到的异常(exception)是:

Traceback (most recent call last):
File "C:/Users/SawyerPC/PycharmProjects/DiscordSocialGraph/encode_and_train.py", line 129, in <module>
enc, clf, split_data = encode_and_train(df)
File "C:/Users/SawyerPC/PycharmProjects/DiscordSocialGraph/encode_and_train.py", line 57, in encode_and_train
svc.fit(X_train, y_train) # TODO y_train needs to be flattened to (n_samples,)
File "C:\Users\SawyerPC\Anaconda3\lib\site-packages\sklearn\svm\base.py", line 149, in fit
X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
File "C:\Users\SawyerPC\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 547, in check_X_y
y = column_or_1d(y, warn=True)
File "C:\Users\SawyerPC\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 583, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (5000, 10)

最佳答案

我最终使用 LabelEncoder 解决了这个问题。谢谢@G.安德森。 flat_member_list 只是标签 y 和向量 X 中遇到的所有唯一用户 ID 的列表。

# Encode "present" users as OneHotVectors
mlb = MultiLabelBinarizer()
print("Encoding data...")
mlb.fit(df["present"] + df["member"].apply(str).apply(lambda x: [x]))

# Encode user labels as ints
enc = LabelEncoder()
flat_member_list = df["member"].apply(str).append(pandas.Series(np.concatenate(df["present"]).ravel()))
enc.fit(flat_member_list)
X, y = mlb.transform(df["present"]), enc.transform(df["member"].apply(str))
print("Training svm...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0, stratify=y)
svc = svm.SVC(C=0.317, kernel="linear", probability=True)
svc.fit(X_train, y_train)

关于python - 将 (n_samples, n_features) ndarray 转换为 (n_samples, 1) 向量数组,用作 sklearn SVM 的训练标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52787553/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com