gpt4 book ai didi

machine-learning - SVM——评分函数

转载 作者:行者123 更新时间:2023-11-30 09:01:10 26 4
gpt4 key购买 nike

我从 weka 获得了以下用于 SVM 分类的输出。我想将 SVM 分类器输出绘制为异常或正常。如何从该输出中获得 SVM 评分函数

===运行信息===

Scheme:       weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.PolyKernel -E 1.0 -C 250007"
Relation: KDDTrain
Instances: 125973
Attributes: 42
duration
protocol_type
service
flag
src_bytes
dst_bytes
land
wrong_fragment
urgent
hot
num_failed_logins
logged_in
num_compromised
root_shell
su_attempted
num_root
num_file_creations
num_shells
num_access_files
num_outbound_cmds
is_host_login
is_guest_login
count
srv_count
serror_rate
srv_serror_rate
rerror_rate
srv_rerror_rate
same_srv_rate
diff_srv_rate
srv_diff_host_rate
dst_host_count
dst_host_srv_count
dst_host_same_srv_rate
dst_host_diff_srv_rate
dst_host_same_src_port_rate
dst_host_srv_diff_host_rate
dst_host_serror_rate
dst_host_srv_serror_rate
dst_host_rerror_rate
dst_host_srv_rerror_rate
class
Test mode: 10-fold cross-validation

===分类器模型(完整训练集)===

SMO

Kernel used:
Linear Kernel: K(x,y) = <x,y>

Classifier for classes: normal, anomaly

BinarySMO

Machine linear: showing attribute weights, not support vectors.

-0.0498 * (normalized) duration
+ 0.5131 * (normalized) protocol_type=tcp
+ -0.6236 * (normalized) protocol_type=udp
+ 0.1105 * (normalized) protocol_type=icmp
+ -1.1861 * (normalized) service=auth
+ 0 * (normalized) service=bgp
+ 0 * (normalized) service=courier
+ 1 * (normalized) service=csnet_ns
+ 1 * (normalized) service=ctf
+ 1 * (normalized) service=daytime
+ -0 * (normalized) service=discard
+ -1.2505 * (normalized) service=domain
+ -0.6878 * (normalized) service=domain_u
+ 0.9418 * (normalized) service=echo
+ 1.1964 * (normalized) service=eco_i
+ 0.9767 * (normalized) service=ecr_i
+ 0.0073 * (normalized) service=efs
+ 0.0595 * (normalized) service=exec
+ -1.4426 * (normalized) service=finger
+ -1.047 * (normalized) service=ftp
+ -1.4225 * (normalized) service=ftp_data
+ 2 * (normalized) service=gopher
+ 1 * (normalized) service=hostnames
+ -0.9961 * (normalized) service=http
+ 0.7255 * (normalized) service=http_443
+ 0.5128 * (normalized) service=imap4
+ -6.3664 * (normalized) service=IRC
+ 1 * (normalized) service=iso_tsap
+ -0 * (normalized) service=klogin
+ 0 * (normalized) service=kshell
+ 0.7422 * (normalized) service=ldap
+ 1 * (normalized) service=link
+ 0.5993 * (normalized) service=login
+ 1 * (normalized) service=mtp
+ 1 * (normalized) service=name
+ 0.2322 * (normalized) service=netbios_dgm
+ 0.213 * (normalized) service=netbios_ns
+ 0.1902 * (normalized) service=netbios_ssn
+ 1.1472 * (normalized) service=netstat
+ 0.0504 * (normalized) service=nnsp
+ 1.058 * (normalized) service=nntp
+ -1 * (normalized) service=ntp_u
+ -1.5344 * (normalized) service=other
+ 1.3595 * (normalized) service=pm_dump
+ 0.8355 * (normalized) service=pop_2
+ -2 * (normalized) service=pop_3
+ 0 * (normalized) service=printer
+ 1.051 * (normalized) service=private
+ -0.3082 * (normalized) service=red_i
+ 1.0034 * (normalized) service=remote_job
+ 1.0112 * (normalized) service=rje
+ -1.0454 * (normalized) service=shell
+ -1.6948 * (normalized) service=smtp
+ 0.1388 * (normalized) service=sql_net
+ -0.3438 * (normalized) service=ssh
+ 1 * (normalized) service=supdup
+ 0.8756 * (normalized) service=systat
+ -1.6856 * (normalized) service=telnet
+ -0 * (normalized) service=tim_i
+ -0.8579 * (normalized) service=time
+ -0.726 * (normalized) service=urh_i
+ -1.0285 * (normalized) service=urp_i
+ 1.0347 * (normalized) service=uucp
+ 0 * (normalized) service=uucp_path
+ 0 * (normalized) service=vmnet
+ 1 * (normalized) service=whois
+ -1.3388 * (normalized) service=X11
+ 0 * (normalized) service=Z39_50
+ 1.7882 * (normalized) flag=OTH
+ -3.0982 * (normalized) flag=REJ
+ -1.7279 * (normalized) flag=RSTO
+ 1 * (normalized) flag=RSTOS0
+ 2.4264 * (normalized) flag=RSTR
+ 1.5906 * (normalized) flag=S0
+ -1.952 * (normalized) flag=S1
+ -0.9628 * (normalized) flag=S2
+ -0.3455 * (normalized) flag=S3
+ 1.2757 * (normalized) flag=SF
+ 0.0054 * (normalized) flag=SH
+ 0.8742 * (normalized) src_bytes
+ 0.0542 * (normalized) dst_bytes
+ -1.2659 * (normalized) land=1
+ 2.7922 * (normalized) wrong_fragment
+ 0.0662 * (normalized) urgent
+ 8.1153 * (normalized) hot
+ 2.4822 * (normalized) num_failed_logins
+ 0.2242 * (normalized) logged_in=1
+ -0.0544 * (normalized) num_compromised
+ 0.9248 * (normalized) root_shell
+ -2.363 * (normalized) su_attempted
+ -0.2024 * (normalized) num_root
+ -1.2791 * (normalized) num_file_creations
+ -0.0314 * (normalized) num_shells
+ -1.4125 * (normalized) num_access_files
+ -0.0154 * (normalized) is_host_login=1
+ -2.3307 * (normalized) is_guest_login=1
+ 4.3191 * (normalized) count
+ -2.7484 * (normalized) srv_count
+ -0.6276 * (normalized) serror_rate
+ 2.843 * (normalized) srv_serror_rate
+ 0.6105 * (normalized) rerror_rate
+ 3.1388 * (normalized) srv_rerror_rate
+ -0.1262 * (normalized) same_srv_rate
+ -0.1825 * (normalized) diff_srv_rate
+ 0.2961 * (normalized) srv_diff_host_rate
+ 0.7812 * (normalized) dst_host_count
+ -1.0053 * (normalized) dst_host_srv_count
+ 0.0284 * (normalized) dst_host_same_srv_rate
+ 0.4419 * (normalized) dst_host_diff_srv_rate
+ 1.384 * (normalized) dst_host_same_src_port_rate
+ 0.8004 * (normalized) dst_host_srv_diff_host_rate
+ 0.2301 * (normalized) dst_host_serror_rate
+ 0.6401 * (normalized) dst_host_srv_serror_rate
+ 0.6422 * (normalized) dst_host_rerror_rate
+ 0.3692 * (normalized) dst_host_srv_rerror_rate
- 2.5266

Number of kernel evaluations: -1049600465

输出预测 - 示例输出

inst#     actual  predicted error prediction
1 1:normal 1:normal 1
2 1:normal 1:normal 1
3 2:anomaly 2:anomaly 1
4 1:normal 1:normal 1
5 1:normal 1:normal 1
6 2:anomaly 2:anomaly 1
7 2:anomaly 2:anomaly 1
8 2:anomaly 2:anomaly 1
9 2:anomaly 2:anomaly 1
10 2:anomaly 2:anomaly 1
11 2:anomaly 2:anomaly 1
12 2:anomaly 2:anomaly 1
13 1:normal 1:normal 1
14 2:anomaly 1:normal + 1
15 2:anomaly 2:anomaly 1
16 2:anomaly 2:anomaly 1
17 1:normal 1:normal 1
18 2:anomaly 2:anomaly 1
19 1:normal 1:normal 1
20 1:normal 1:normal 1
21 2:anomaly 2:anomaly 1
22 2:anomaly 2:anomaly 1
23 1:normal 1:normal 1
24 1:normal 1:normal 1
25 2:anomaly 2:anomaly 1
26 1:normal 1:normal 1
27 2:anomaly 2:anomaly 1
28 1:normal 1:normal 1
29 1:normal 1:normal 1
30 1:normal 1:normal 1
31 2:anomaly 2:anomaly 1
32 2:anomaly 2:anomaly 1
33 1:normal 1:normal 1
34 2:anomaly 2:anomaly 1
35 1:normal 1:normal 1
36 1:normal 1:normal 1
37 1:normal 1:normal 1
38 2:anomaly 2:anomaly 1
39 1:normal 1:normal 1
40 2:anomaly 2:anomaly 1
41 2:anomaly 2:anomaly 1
42 2:anomaly 2:anomaly 1
43 1:normal 1:normal 1
44 1:normal 1:normal 1
45 1:normal 1:normal 1
46 2:anomaly 2:anomaly 1
47 2:anomaly 2:anomaly 1
48 1:normal 1:normal 1
49 2:anomaly 1:normal + 1
50 2:anomaly 2:anomaly 1

=== 按类别划分的详细准确度 ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.986 0.039 0.967 0.986 0.976 0.948 0.973 0.960 normal
0.961 0.014 0.983 0.961 0.972 0.948 0.973 0.963 anomaly
Weighted Avg. 0.974 0.028 0.974 0.974 0.974 0.948 0.973 0.962

===混淆矩阵===

     a     b   <-- classified as
66389 954 | a = normal
2301 56329 | b = anomaly

最佳答案

该输出评分函数。将等号视为简单的 bool 运算符,计算结果为 1 表示 true,0 表示 false。因此,在分类属性的所有选择中,只有一个系数会影响评分值。

例如,我们只考虑前三个属性,以及这些标准化输入和结果值:

duration      2.0     -0.0498 * 2.0 => -0.0996
protocol_type icmp 0.1105
service eco_i 1.1964

请注意其他protocol_type服务术语(例如

-0.6236 * protocol_type=udp

)的比较结果为 0(protocol_type=upd 变为 0),因此这些系数不会影响总和。

从这三个属性来看,到目前为止的分数是这三项的总和,即 1.2073。继续使用其他 39 个属性,加上最后的常量 -2.5266,这就是向量的分数。

这足够解释它了吗?

<小时/>

您引用的博客中的关键短语是:

if the output of the scoring function is negative then the input is classified as belonging to class y = -1. If the score is positive, the input is classified as belonging to class y = 1.

是的,就是这么简单:实现那个漂亮的线性评分函数(42 个变量,116 个术语)。插入一个向量。如果函数为正,则向量为正态向量;如果结果为负,则该向量是异常的。

是的,您的模型比博客的示例要长得多。该示例基于两个连续特征;你有 42 个特征,其中三个是分类特征(因此有额外的 73 个术语)。该示例有 3 个支持向量;你的将有 43 个(N 维需要 N+1 支持向量)。然而,即使是这个 42 维模型也遵循相同的原理:正 = 正常,负 = 异常。

<小时/>

至于您希望映射到二维显示......这是可能的......但我不知道您会发现什么有意义这个例子。将 42 个变量映射到 3 个变量会导致我们的空间变得非常拥挤。我到处都看到了一些不错的技巧,特别是在梯度场中,力矢量与数据点具有相同的空间解释。天气图设法表示测量的 x、y、z 坐标,将风速 (3D)、云量以及可能的其他几个指标添加到显示中。这可能是 10 个符号维度。

就您的情况而言,我们也许可以将系数小于 0.07 的维度视为微不足道而删除;节省 6 个功能。我们也许可以用颜色、虚线/点线/实线符号以及 O 或 X(正常/异常数据)上的微小文本覆盖来表示这三个分类特征。如果不使用笛卡尔位置(x、y、z 坐标,假设绘图在 3D 中有意义),则下降 9。

但是,我对您的数据了解不够深入,无法建议我们将其余 33 个特征填充到 2 或 3 维中。你能以某种方式结合这些输入吗?多个特征的线性组合是否会给出对预测仍然有意义的结果?

如果没有,那么我们就只能采用规范的方法:选择有趣的特征组合(通常是成对的)。为每个特征绘制一个图表,完全忽略其他特征。如果这些都没有视觉意义……我们的答案是:不,我们无法很好地绘制数据。抱歉,但现实经常在复杂的环境中对我们这样做,我们以表格、关联和其他我们可以用 3D 思维处理的方法来处理数据。

关于machine-learning - SVM——评分函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35456621/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com