gpt4 book ai didi

python - 多类 SVM 无法使用 20 个新闻组数据集

转载 作者:太空宇宙 更新时间:2023-11-03 21:25:10 25 4
gpt4 key购买 nike

我正在尝试使用 Mblondel Multiclass SVM 中的多类 SVM 代码,我读了他的论文,他使用了来自 sklearn 20newsgroup 的数据集,但是当我尝试使用它时,代码无法正常工作。

我尝试更改代码以匹配 20newsgroup 数据集。但我陷入了这个错误..

Traceback (most recent call last):

File "F:\env\chatbotstripped\CSSVM.py", line 157, in

clf.fit(X, y)

File "F:\env\chatbotstripped\CSSVM.py", line 106, in fit

v = self._violation(g, y, i)

File "F:\env\chatbotstripped\CSSVM.py", line 50, in _violation

elif k != y[i] and self.dual_coef_[k, i] >= 0:

IndexError: index 20 is out of bounds for axis 0 with size 20

这是主要代码:

from sklearn.datasets import fetch_20newsgroups
news_train = fetch_20newsgroups(subset='train')
X, y = news_train.data[:100], news_train.target[:100]

clf = MulticlassSVM(C=0.1, tol=0.01, max_iter=100, random_state=0, verbose=1)
X = TfidfVectorizer().fit_transform(X)
clf.fit(X, y)
print(clf.score(X, y))

这是合适的代码:

def fit(self, X, y):
n_samples, n_features = X.shape

self._label_encoder = LabelEncoder()
y = self._label_encoder.fit_transform(y)

n_classes = len(self._label_encoder.classes_)
self.dual_coef_ = np.zeros((n_classes, n_samples), dtype=np.float64)
self.coef_ = np.zeros((n_classes, n_features))

norms = np.sqrt(np.sum(X.power(2), axis=1)) # i changed this code

rs = check_random_state(self.random_state)
ind = np.arange(n_samples)
rs.shuffle(ind)

# i added this sparse
sparse = sp.isspmatrix(X)
if sparse:
X = np.asarray(X.data, dtype=np.float64, order='C')

for it in range(self.max_iter):
violation_sum = 0
for ii in range(n_samples):
i = ind[ii]

if norms[i] == 0:
continue

g = self._partial_gradient(X, y, i)
v = self._violation(g, y, i)
violation_sum += v

if v < 1e-12:
continue

delta = self._solve_subproblem(g, y, norms, i)
self.coef_ += (delta * X[i][:, np.newaxis]).T
self.dual_coef_[:, i] += delta

if it == 0:
violation_init = violation_sum

vratio = violation_sum / violation_init

if self.verbose >= 1:
print("iter", it + 1, "violation", vratio)

if vratio < self.tol:
if self.verbose >= 1:
print("Converged")
break
return self

和_违规代码:

def _violation(self, g, y, i):
smallest = np.inf
for k in range(g.shape[0]):
if k == y[i] and self.dual_coef_[k, i] >= self.C:
continue
elif k != y[i] and self.dual_coef_[k, i] >= 0:
continue

smallest = min(smallest, g[k].all()) # and i added .all()
return g.max() - smallest

我知道索引有问题,我不知道如何修复它,而且我不想破坏代码,因为我真的不明白这段代码是如何工作的。

最佳答案

您必须将 tfidf 矢量器的稀疏矩阵输出转换为密集矩阵,然后将其作为二维数组。试试这个!

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
news_train = fetch_20newsgroups(subset='train')
text, y = news_train.data[:1000], news_train.target[:1000]

clf = MulticlassSVM(C=0.1, tol=0.01, max_iter=100, random_state=0, verbose=1)
vectorizer= TfidfVectorizer(min_df=20,stop_words='english')
X = np.asarray(vectorizer.fit_transform(text).todense())
clf.fit(X, y)
print(clf.score(X, y))

输出:

iter 1 violation 1.0
iter 2 violation 0.07075102408683964
iter 3 violation 0.018288133735158228
iter 4 violation 0.009149083942255389
Converged
0.953

关于python - 多类 SVM 无法使用 20 个新闻组数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53895434/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com