python - 如何在python中的SVM sklearn数据中绘制决策边界？-6ren

python - 如何在python中的SVM sklearn数据中绘制决策边界？

转载作者：太空宇宙更新时间：2023-11-04 05:08:10

我正在从训练集中读取电子邮件数据并创建 train_matrix、train_labels 和 test_labels。现在如何在 python 中使用 matplot 显示决策边界。我正在使用 sklearn 的 svm。有通过 iris 预先给定数据集的在线示例。但是在自定义数据上绘图失败。这是我的代码

错误:

Traceback (most recent call last):
  File "classifier-plot.py", line 115, in <module>
    Z = Z.reshape(xx.shape)
ValueError: cannot reshape array of size 260 into shape (150,1750)

代码:

import os
import numpy as np
from collections import Counter
from sklearn import svm
import matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score


def make_Dictionary(root_dir):
    all_words = []
    emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
    for mail in emails:
        with open(mail) as m:
            for line in m:
                words = line.split()
                all_words += words
    dictionary = Counter(all_words)
    list_to_remove = dictionary.keys()

    for item in list_to_remove:
        if item.isalpha() == False:
            del dictionary[item]
        elif len(item) == 1:
            del dictionary[item]
    dictionary = dictionary.most_common(3000)

    return dictionary



def extract_features(mail_dir):
    files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files),3000))
    train_labels = np.zeros(len(files))
    count = 0;
    docID = 0;
    for fil in files:
      with open(fil) as fi:
        for i,line in enumerate(fi):
          if i == 2:
            words = line.split()
            for word in words:
              wordID = 0
              for i,d in enumerate(dictionary):
                if d[0] == word:
                  wordID = i
                  features_matrix[docID,wordID] = words.count(word)
        train_labels[docID] = 0;
        filepathTokens = fil.split('/')
        lastToken = filepathTokens[len(filepathTokens) - 1]
        if lastToken.startswith("spmsg"):
            train_labels[docID] = 1;
            count = count + 1
        docID = docID + 1
    return features_matrix, train_labels



TRAIN_DIR = "../train-mails"
TEST_DIR = "../test-mails"

dictionary = make_Dictionary(TRAIN_DIR)

print "reading and processing emails from file."
features_matrix, labels = extract_features(TRAIN_DIR)
test_feature_matrix, test_labels = extract_features(TEST_DIR)


model = svm.SVC(kernel="rbf", C=10000)

print "Training model."
features_matrix = features_matrix[:len(features_matrix)/10]
labels = labels[:len(labels)/10]
#train model
model.fit(features_matrix, labels)

predicted_labels = model.predict(test_feature_matrix)

print "FINISHED classifying. accuracy score : "
print accuracy_score(test_labels, predicted_labels)







##----------------

h = .02  # step size in the mesh

# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0  # SVM regularization parameter
X = features_matrix
y = labels
svc = model.fit(X, y)
#svm.SVC(kernel='linear', C=C).fit(X, y)

# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = y[:].min() - 1, y[:].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots
titles = ['SVC with linear kernel']



Z = predicted_labels#svc.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[0])

plt.show()

最佳答案

在tutorial您正在关注 Z 是通过将分类器应用于一组生成的特征向量来计算的，这些特征向量生成为形成一个规则的 NxM 网格。这样剧情就流畅了。

当你更换

Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])

与

Z = predicted_labels

您将此常规网格替换为对您的数据集所做的预测。下一行因错误而失败，因为它无法将大小为 len(files) 的数组整形为 NxM 矩阵。没有理由 len(files) = NxM。

您无法直接按照教程进行操作是有原因的。您的数据维度是 3000，因此您的决策边界将是 3000 维空间中的 2999 维超平面。这不容易形象化。

在教程中维度是 4，为了可视化它被减少到 2。减少数据维度的最佳方法取决于数据。在本教程中，我们只选择 4 维向量的前两个分量。

在许多情况下，另一种行之有效的方法是使用主成分分析来降低数据的维度。

from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(features_matrix, labels)
reduced_matrix = pca.fit_transform(features_matrix, labels)
model.fit(reduced_matrix, labels)

此类模型可用于二维可视化。你可以直接按照教程定义

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

一个完整但不是很令人印象深刻的例子

我们无权访问您的电子邮件数据，因此为了说明起见，我们可以只使用随机数据。

from sklearn import svm
from sklearn.decomposition import PCA

# initialize algorithms and data with random
model = svm.SVC(gamma=0.001,C=100.0)
pca = PCA(n_components = 2)
rng = np.random.RandomState(0)
U = rng.rand(200, 2000)
v = (rng.rand(200)*2).astype('int')
pca.fit(U,v)
U2 = pca.fit_transform(U,v)
model.fit(U2,v)

# generate grid for plotting
h = 0.2
x_min, x_max = U2[:,0].min() - 1, U2[:, 0].max() + 1
y_min, y_max = U2[:,1].min() - 1, U2[:, 1].max() + 1
xx, yy = np.meshgrid(
    np.arange(x_min, x_max, h),
    np.arange(y_min, y_max, h))

# create decision boundary plot
Z = s.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
contourf(xx,yy,Z,cmap=plt.cm.coolwarm, alpha=0.8)
scatter(U2[:,0],U2[:,1],c=v)
show()

会产生一个看起来不太令人印象深刻的决策边界。

事实上，前两个主要成分只捕获了数据中所含信息的大约 1%

>>> print(pca.explained_variance_ratio_) 
[ 0.00841935  0.00831764]

如果现在您只引入一点精心伪装的不对称性，您就会看到效果。

修改数据以在为每个特征随机选择的一个坐标处引入位移

random_shifts = (rng.rand(2000)*200).astype('int')
for i in range(MM):
    if v[i] == 1:
        U[i,random_shifts[i]] += 5.0

应用 PCA，您会得到更多信息。

请注意，这里的前两个主成分已经解释了大约 5% 的方差，并且图片的红色部分包含的红色点比蓝色部分多得多。

关于python - 如何在python中的SVM sklearn数据中绘制决策边界？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43778380/

文章推荐： c# - 将 C 迁移到 C# 问题!

文章推荐： css - 如何使用 css 将两种不同的颜色放在标题后面？

文章推荐： c - 初学者在 C 中使用 fopen 命令的困难

文章推荐： CSS 布局 - 避免垂直滚动条

testing - 决策/条件覆盖
是否有显示测试用例数量以提供决策/条件覆盖率的工具？例如: if(x>0) if(x0) 3 个案例足以覆盖决策/条件。 if(x>0) if(x0) 4 个案例足以覆盖决策/条件。这是真的吗？
c++ - 决策、复杂条件和规划易于维护
我正在尝试找到一种优雅的方式来实现易于维护的决策算法，因为决策的条件可能经常变化。我将尝试更具体地举一个例子: 假设我正在尝试管理一家餐厅厨房的 cooking 厨师团队。每个厨师都知道如何 co
android - 决策 : ListView or ScrollView
我需要一个 Android Activity ，它应该显示一个字段，如带有图像的标题和其下方的几个动态生成的项目(我认为是 1 到 100)。如果我不想让 headsection 滚动，我会使用 Li
algorithm - 最大值(value)决策
我正在编写函数以从值列表中提供最大值(value)。我的问题是如果所有值都相同怎么办？例如， 30,29,34,45 简单。最大值为 45。现在， 20,20,20,20 这里的最大值是20吗？或者没
json - Postgres 决策、JSON 或额外的列？
我需要知道哪个检索事件日志的速度更快，但我在比较中找不到:假设需要查找的所有列都有btree索引，需要查找的json对象中的所有键都有GIN索引。 case 1: ActivityID (in
ios - Swift 自定义单元格布局 TableView 决策
我需要在我的 iPhone 应用程序中显示一个表格: neither the number of cells nor the contents are known at compile time, b
mysql - 数据库中的数据太多 - 需要做出 "replication"决策
关闭。这个问题需要多问focused 。目前不接受答案。想要改进此问题吗？更新问题，使其仅关注一个问题 editing this post . 已关闭 9 年前。 Improve this ques
c++ - 物理引擎的继承/接口(interface)决策
这是针对在 MinGW/Windows 上使用 SDL 的小型游戏项目。我正在研究一个物理引擎，我的想法是拥有一个Physics::Object，所有物理对象都应该派生自它，并且它会在全局 Phys
c# - LINQ 查询中的 If Else 决策
我有一个小的 LINQ 查询来填充下拉控件(WinForms Telerik 应用程序)，其中的数据行显示两个值(ITNBR 和描述): var query = from i in db.ItemMa
java - 由于递归规则调用，ANTLR3 错误规则具有非 LL(*) 决策
我正在尝试使用 antlr 3 为我的语法生成词法分析器和解析器。有人可以解释这个错误是什么意思吗？ error(211): T.g:14:6: [fatal] rule stmt has non-L
r - 更改 R 方图中的标签位置(决策/回归树)
partykit包很好地表示了决策树。我遇到的唯一问题是标签很长然后它们重叠。是否可以移动这些标签以防止它(见下图中的蓝色箭头)？ library("rpart") library("partykit
c# - _Layout.cshtml 上的 MVC 决策
所以我环顾四周，似乎找不到合适的解决方案来解决我的问题。问题在我的布局中，我希望能够根据数据库中的内容选择在运行时是否存在导航项: 当前布局(导航栏) @Html.Acti
python - 在 jupyter notebook 中显示 scikit 决策 TreeMap
我目前正在创建一个机器学习 jupyter 笔记本作为一个小项目，并希望显示我的决策树。但是，我能找到的所有选项都是导出图形然后加载图片，这相当复杂。所以想问问有没有办法不用导出加载图形，直接显示我
ANTLR:由于可从 alts 1,2 访问递归规则调用，因此规则 token 具有非 LL(*) 决策
grammar AdifyMapReducePredicate; PREDICATE : PREDICATE_BRANCH | EXPRESSION ; PREDICA

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何在python中的SVM sklearn数据中绘制决策边界？

一个完整但不是很令人印象深刻的例子