python - 使用 scikit-learn python 的线性 SVM 时出现 ValueError-6ren

python - 使用 scikit-learn python 的线性 SVM 时出现 ValueError

转载作者：太空狗更新时间：2023-10-29 21:31:45

我目前正在研究 ODP 文档的大规模分层文本分类。提供给我的数据集是 libSVM 格式的。我正在尝试运行 python 的 scikit-learn 的线性核 SVM 来开发模型。以下是来自训练样本的样本数据:

29 9454:1 11742:1 18884:14 26840:1 35147:1 52782:1 72083:1 73244:1 78945:1 79913:1 79986:1 86710:3 117286:1 139820:1 142458:1 146315:1 151005:2 161454:3 172237:1 1091130:1 1113562:1 1133451:1 1139046:1 1157534:1 1180618:2 1182024:1 1187711:1 1194345:3 

33 2474:1 8152:1 19529:2 35038:1 48104:1 59738:1 61854:3 67943:1 74093:1 78945:1 88558:1 90848:1 97087:1 113284:16 118917:1 122375:1 124939:1

下面是我用来构建线性SVM模型的代码

from sklearn.datasets import load_svmlight_file
from sklearn import svm
X_train, y_train = load_svmlight_file("/path-to-file/train.txt")
X_test, y_test = load_svmlight_file("/path-to-file/test.txt")
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
print clf.score(X_test,y_test)

运行 clf.score() 时，出现以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-b285fbfb3efe> in <module>()
      1 start_time = time.time()
----> 2 print clf.score(X_test,y_test)
      3 print time.time() - start_time, "seconds"

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in score(self, X, y)
    292         """
    293         from .metrics import accuracy_score
--> 294         return accuracy_score(y, self.predict(X))
    295 
    296 

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
    464             Class labels for samples in X.
    465         """
--> 466         y = super(BaseSVC, self).predict(X)
    467         return self.classes_.take(y.astype(np.int))
    468 

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
    280         y_pred : array, shape (n_samples,)
    281         """
--> 282         X = self._validate_for_predict(X)
    283         predict = self._sparse_predict if self._sparse else self._dense_predict
    284         return predict(X)

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in _validate_for_predict(self, X)
    402             raise ValueError("X.shape[1] = %d should be equal to %d, "
    403                              "the number of features at training time" %
--> 404                              (n_features, self.shape_fit_[1]))
    405         return X
    406 

ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time

有人可以让我知道这段代码或我拥有的数据究竟有什么问题吗？提前致谢

下面附上X_train、y_train、X_test、y_test的值:

X_train:

  (0, 9453)         1.0
  (0, 11741)    1.0
  (0, 18883)    14.0
  (0, 26839)    1.0
  (0, 35146)    1.0
  (0, 52781)    1.0
  (0, 72082)    1.0
  (0, 73243)    1.0
  (0, 78944)    1.0
  (0, 79912)    1.0
  (0, 79985)    1.0
  (0, 86709)    3.0
  (0, 117285)   1.0
  (0, 139819)   1.0
  (0, 142457)   1.0
  (0, 146314)   1.0
  (0, 151004)   2.0
  (0, 161453)   3.0
  (0, 172236)   1.0
  (0, 187531)   2.0
  (0, 202462)   1.0
  (0, 210417)   1.0
  (0, 250581)   1.0
  (0, 251689)   1.0
  (0, 296384)   2.0
  : :
  (4462, 735469)    1.0
  (4462, 737059)    15.0
  (4462, 740127)    1.0
  (4462, 743798)    1.0
  (4462, 766063)    1.0
  (4462, 778958)    2.0
  (4462, 784004)    4.0
  (4462, 837264)    2.0
  (4462, 839095)    22.0
  (4462, 844735)    6.0
  (4462, 859721)    2.0
  (4462, 875267)    1.0
  (4462, 910761)    1.0
  (4462, 931244)    1.0
  (4462, 945069)    6.0
  (4462, 948728)    1.0
  (4462, 948850)    2.0
  (4462, 957682)    1.0
  (4462, 975170)    1.0
  (4462, 989192)    1.0
  (4462, 1014294)   1.0
  (4462, 1042424)   1.0
  (4462, 1049027)   1.0
  (4462, 1072931)   1.0
  (4462, 1145790)   1.0

y_train:

[  2.90000000e+01   3.30000000e+01   3.30000000e+01 ...,   1.65475000e+05
   1.65518000e+05   1.65518000e+05]

X_测试:

  (0, 18573)    1.0
  (0, 23501)    1.0
  (0, 29954)    1.0
  (0, 42112)    1.0
  (0, 46402)    1.0
  (0, 63041)    2.0
  (0, 67942)    2.0
  (0, 83522)    1.0
  (0, 88413)    2.0
  (0, 99454)    1.0
  (0, 126041)   1.0
  (0, 139819)   1.0
  (0, 142678)   1.0
  (0, 151004)   1.0
  (0, 166351)   2.0
  (0, 173794)   1.0
  (0, 192162)   3.0
  (0, 210417)   2.0
  (0, 254468)   1.0
  (0, 263895)   2.0
  (0, 277567)   1.0
  (0, 278419)   2.0
  (0, 279181)   2.0
  (0, 281319)   2.0
  (0, 298898)   1.0
  : :
  (1857, 1100504)   3.0
  (1857, 1103247)   1.0
  (1857, 1105578)   1.0
  (1857, 1108986)   2.0
  (1857, 1118486)   1.0
  (1857, 1120807)   9.0
  (1857, 1129243)   2.0
  (1857, 1131786)   1.0
  (1857, 1134029)   2.0
  (1857, 1134410)   5.0
  (1857, 1134494)   1.0
  (1857, 1139045)   25.0
  (1857, 1142239)   3.0
  (1857, 1142651)   1.0
  (1857, 1144787)   1.0
  (1857, 1151891)   1.0
  (1857, 1152094)   1.0
  (1857, 1157533)   1.0
  (1857, 1159376)   1.0
  (1857, 1178944)   1.0
  (1857, 1181310)   2.0
  (1857, 1182023)   1.0
  (1857, 1187098)   1.0
  (1857, 1194344)   2.0
  (1857, 1195819)   9.0

y_测试:

[  2.90000000e+01   3.30000000e+01   1.56000000e+02 ...,   1.65434000e+05
   1.65475000e+05   1.65518000e+05]

最佳答案

错误信息

ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time

self 解释:测试数据中的特征数量与用于训练模型的训练数据相比是不同的。也就是说，X_train.shape[1] 不等于 X_test.shape[1]。

您应该检查为什么它们不相等，因为它们应该相等。

一种可能是它们作为稀疏矩阵加载，特征数量由 load_svmlight_file 推断。 .如果测试数据包含训练数据看不到的特征，则生成的 X_test 可能具有更大的维度。为避免这种情况，您可以通过传递参数 n_features 来指定 load_svmlight_file 中的特征数量。

关于python - 使用 scikit-learn python 的线性 SVM 时出现 ValueError，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22167095/

文章推荐： c++ - 如何报告异或为零的子数组的索引？

文章推荐： c# - 通过功能区代码隐藏在 Word 中打开文件

Python多处理池 'raise ValueError("池未运行“)ValueError : Pool not running' function with return value
我正在尝试并行运行具有循环返回值的函数。但它似乎停留在 results = pool.map(algorithm_file.foo, population) 在 for 循环的第二次迭代中 r
python - 引发 ValueError ("cannot have a multithreaded and multi process server.") ValueError : cannot have a multithreaded and multi process server
Serving Flask 应用程序“服务器”(延迟加载) 环境:生产警告:这是一个开发服务器。不要在生产部署中使用它。请改用生产 WSGI 服务器。 Debug模式:开启在 http://0.0.
python - 引发 ValueError ("Expected singleton: %s"% self) ValueError : Expected singleton: product. Pricelist()
我使用“product.pricelist”模型中的 get_product_price_rule() 函数。我的代码是: price = self._get_display_price(produ
Python valueError 使用 hstack() (ValueError : all the input array dimensions except for the concatenation axis must match exactly)
我收到以下错误: Traceback (most recent call last): File "/home/odroid/trackAndFollow/getPositions.py", line
machine-learning - 提高 ValueError ("Unknown label type: %s"% repr(ys)) ValueError : Unknown label type: (array
我正在尝试采用机器学习方法，但遇到了一些问题。这是我的代码: import sys import scipy import numpy import matplotlib import pandas
tensorflow 错误 "raise ValueError("形状 %s 和 %s 不兼容"% (self, other)) ValueError : Shapes (? , 5) and (5,) are not compatible"
我尝试使用 tensorflow 1.4.0 对我的原始记录进行分类。过程如下。拳头:读取图片和标签，输出“tfrecord”格式的文件。第二:读取tf记录和训练编写tfrecord脚本是 !/u
python - 引发 ValueError ("bad input shape {0}".format(shape)) ValueError : bad input shape (10, 90)
我是新手，所以需要任何帮助，当我要求一个例子时，我的教授给我了这段代码，我希望有一个工作模型...... from numpy import loadtxt import numpy as np fr
python - 无法使用 json、requests、BeautifulSoup : ValueError(errmsg ("Extra data", s、end、len(s)) 找出 ValueError
我无法弄清楚为什么会出现此 ValueError...为了提供一些上下文，我正在使用 requests、BeautifulSoup 和 json 与 python 来抓取站点 json 数据。我不确
Python List -- ValueError: invalid literal for int() with base 10: ' ' [duplicate](Python List--ValueError：基数为10的int()的文本无效：‘’[Duplate])
我已经尝试使用这两个循环以及列表理解。即使我正在尝试将数字转换为列表中的整型，两者都无法解析整数。
Python List -- ValueError: invalid literal for int() with base 10: ' ' [duplicate](Python List--ValueError：基数为10的int()的文本无效：‘’[Duplate])
我已经尝试使用这两个循环以及列表理解。即使我正在尝试将数字转换为列表中的整型，两者都无法解析整数。
python-3.x - Python 图像保存错误 - 从 e ValueError : unknown file extension: 引发 ValueError ("unknown file extension: {}".format(ext))
我只有四个星期的 Python 经验。使用 Tkinter 创建一个工具，将新的公司 Logo 粘贴到现有图像之上。下面的方法是获取给定目录中的所有图像并将新 Logo 粘贴到初始级别。现有图像、编
python-3.x - Python 图像保存错误 - 从 e ValueError : unknown file extension: 引发 ValueError ("unknown file extension: {}".format(ext))
我只有四个星期的 Python 经验。使用 Tkinter 创建一个工具，将新的公司 Logo 粘贴到现有图像之上。下面的方法是获取给定目录中的所有图像并将新 Logo 粘贴到初始级别。现有图像、编
python-3.x - Keras ValueError : ValueError: Error when checking target: expected dense_4 to have shape (None, 2) 但得到了形状为 (2592, 1) Python3 的数组
我在尝试在 Keras 2.0.8、Python 3.6.1 和 Tensorflow 后端中训练模型时遇到问题。错误消息: ValueError: Error when checking targ
Python List -- ValueError: invalid literal for int() with base 10: ' ' [duplicate](Python List -- ValueError：invalid literal for int（）with base 10：' ' [duplicate])
我已经尝试使用这两个循环以及列表理解。即使我正在尝试将数字转换为列表中的整型，两者都无法解析整数。
Python ValueError 是否可以在不进行字符串解析的情况下获得不正确的值？
我有这段代码: while True: try: start = int(input("Starting number: ")) fin = int(i
python - 初学者得到 ValueError
我是 python 的初学者编码员，试图制作一个“模具滚筒”，您可以在其中选择模具的大小，它在我的代码的第 20 行返回此错误 import sys import random import geto
python - 时间序列数据中的 ValueError
我有以下代码: import fxcmpy import pandas as pd from pandas import datetime from pandas import DataFrame a
python - ValueError at/(未设置所需的参数名称)
我正在尝试使用 django 和 python 制作一个博客应用程序。我也在尝试使用 s3 存储桶进行存储，使用 heroku 进行部署。我正在学习 coreymschafer 的在线教程。我正在按照
python - 更改订单后如何解决numpy ValueError？
我创建了一个 numpy 数组(考虑输入数据)并想更改顺序(一些数值运算后的输出数据)。在使用转换后的数组时，我遇到错误并找到了根本原因。请在下面找到详细信息并使用 numpy 版本 1.19.1 i
Python:ValueError:所有参数都应该具有相同的长度
我已经引用了之前的查询 All arguments should have the same length plotly但仍然没有得到我的问题的答案。我有一个黄金价格数据集。 Date

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 使用 scikit-learn python 的线性 SVM 时出现 ValueError