machine-learning - sklearn oneclass svm KeyError-6ren

machine-learning - sklearn oneclass svm KeyError

转载作者：行者123 更新时间：2023-11-30 09:39:17

我的数据集是一组恶意软件和良性的系统调用，我对其进行了预处理，现在看起来像这样

NtQueryPerformanceCounter
NtProtectVirtualMemory
NtProtectVirtualMemory
NtQuerySystemInformation
NtQueryVirtualMemory
NtQueryVirtualMemory
NtProtectVirtualMemory
NtOpenKey
NtOpenKey
NtOpenKey
NtQuerySecurityAttributesToken
NtQuerySecurityAttributesToken
NtQuerySystemInformation
NtQuerySystemInformation
NtAllocateVirtualMemory
NtFreeVirtualMemory

现在我使用 tfidf 提取特征，然后使用 ngram 生成它们的序列

from __future__ import print_function

import numpy as np
import pandas as pd
from time import time
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import shuffle
from sklearn.svm import OneClassSVM

nGRAM1 = 8
nGRAM2 = 10
weight = 4

main_corpus_MAL = []
main_corpus_target_MAL = []
main_corpus_BEN = []
main_corpus_target_BEN = []

my_categories = ['benign', 'malware']

# feeding corpus the testing data

print("Loading system call database for categories:")
print(my_categories if my_categories else "all")

import glob
import os

malCOUNT = 0
benCOUNT = 0
for filename in glob.glob(os.path.join('C:\\Users\\alika\\Documents\\testingSVM\\sysMAL', '*.txt')):
    fMAL = open(filename, "r")
    aggregate = ""
    for line in fMAL:
        linea = line[:(len(line)-1)]
        aggregate += " " + linea
    main_corpus_MAL.append(aggregate)
    main_corpus_target_MAL.append(1)
    malCOUNT += 1

for filename in glob.glob(os.path.join('C:\\Users\\alika\\Documents\\testingSVM\\sysBEN', '*.txt')):
    fBEN = open(filename, "r")
    aggregate = ""
    for line in fBEN:
        linea = line[:(len(line) - 1)]
        aggregate += " " + linea
    main_corpus_BEN.append(aggregate)
    main_corpus_target_BEN.append(0)
    benCOUNT += 1

# weight as determined in the top of the code
train_corpus = main_corpus_BEN[:(weight*len(main_corpus_BEN)//(weight+1))]
train_corpus_target = main_corpus_target_BEN[:(weight*len(main_corpus_BEN)//(weight+1))]
test_corpus = main_corpus_MAL[(len(main_corpus_MAL)-(len(main_corpus_MAL)//(weight+1))):]
test_corpus_target = main_corpus_target_MAL[(len(main_corpus_MAL)-len(main_corpus_MAL)//(weight+1)):]

def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

# size of datasets
train_corpus_size_mb = size_mb(train_corpus)
test_corpus_size_mb = size_mb(test_corpus)

print("%d documents - %0.3fMB (training set)" % (
    len(train_corpus_target), train_corpus_size_mb))
print("%d documents - %0.3fMB (test set)" % (
    len(test_corpus_target), test_corpus_size_mb))
print("%d categories" % len(my_categories))
print()
print("Benign Traces: "+str(benCOUNT)+" traces")
print("Malicious Traces: "+str(malCOUNT)+" traces")
print()

print("Extracting features from the training data using a sparse vectorizer...")
t0 = time()

vectorizer = TfidfVectorizer(ngram_range=(nGRAM1, nGRAM2), min_df=1, use_idf=True, smooth_idf=True) ##############

analyze = vectorizer.build_analyzer()

X_train = vectorizer.fit_transform(train_corpus)

duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, train_corpus_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_train.shape)
print()

print("Extracting features from the test data using the same vectorizer...")
t0 = time()
X_test = vectorizer.transform(test_corpus)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, test_corpus_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_test.shape)
print()

输出为:

Loading system call database for categories:
['benign', 'malware']
177 documents - 45.926MB (training set)
44 documents - 12.982MB (test set)
2 categories

Benign Traces: 72 traces
Malicious Traces: 150 traces

Extracting features from the training data using a sparse vectorizer...
done in 7.831695s at 5.864MB/s
n_samples: 177, n_features: 603170

Extracting features from the test data using the same vectorizer...
done in 1.624100s at 7.993MB/s
n_samples: 44, n_features: 603170

现在，对于学习部分，我尝试使用 sklearn OneClassSVM:

print("==================\n")
print("Training: ")
classifier = OneClassSVM(kernel='linear', gamma='auto')
classifier.fit(X_test)

fraud_pred = classifier.predict(X_test)

unique, counts = np.unique(fraud_pred, return_counts=True)
print (np.asarray((unique, counts)).T)

fraud_pred = pd.DataFrame(fraud_pred)
fraud_pred= fraud_pred.rename(columns={0: 'prediction'})
main_corpus_target = pd.DataFrame(main_corpus_target)
main_corpus_target= main_corpus_target.rename(columns={0: 'Category'})

这是 fraud_pred 和 main_corpus_target 的输出

prediction
0   1
1  -1
2   1
3   1
4   1
5  -1
6   1
7  -1
...
30 rows * 1 column
====================
Category
0   1
1   1
2   1
3   1
4   1
...
217 0
218 0
219 0
220 0
221 0
222 rows * 1 column

但是当我尝试计算TP,TN,FP,FN时:

##Performance check of the model

TP = FN = FP = TN = 0
for j in range(len(main_corpus_target)):
    if main_corpus_target['Category'][j]== 0 and fraud_pred['prediction'][j] == 1:
        TP = TP+1
    elif main_corpus_target['Category'][j]== 0 and fraud_pred['prediction'][j] == -1:
        FN = FN+1
    elif main_corpus_target['Category'][j]== 1 and fraud_pred['prediction'][j] == 1:
        FP = FP+1
    else:
        TN = TN +1
print (TP,  FN,  FP,  TN)

我收到此错误:

KeyError                                  Traceback (most recent call last)
<ipython-input-32-1046cc75ba83> in <module>
      7     elif main_corpus_target['Category'][j]== 0 and fraud_pred['prediction'][j] == -1:
      8         FN = FN+1
----> 9     elif main_corpus_target['Category'][j]== 1 and fraud_pred['prediction'][j] == 1:
     10         FP = FP+1
     11     else:

c:\users\alika\appdata\local\programs\python\python36\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
   1069         key = com.apply_if_callable(key, self)
   1070         try:
-> 1071             result = self.index.get_value(self, key)
   1072 
   1073             if not is_scalar(result):

c:\users\alika\appdata\local\programs\python\python36\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
   4728         k = self._convert_scalar_indexer(k, kind="getitem")
   4729         try:
-> 4730             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4731         except KeyError as e1:
   4732             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 30

1) 我知道错误是因为它试图访问不在字典中的 key ，但我不能只在 fraud_pred 中插入一些数字来处理这个问题，任何建议？？
2)我做错了什么，他们不匹配吗？
3)我想将结果与其他一类分类算法进行比较，由于我的方法，我可以使用的最好的是什么？？

最佳答案

编辑:在计算指标之前:

您可以将拟合和预测函数更改为:

fraud_pred = classifier.fit_predict(X_test)

此外，您的 main_corpus_target 和 X_test 应该具有相同的长度，您可以将代码放在创建 main_corpus_target 的位置吗？

its created it right after the benCOUNT += 1: main_corpus_target = main_corpus_target_MAL main_corpus_target.extend(main_corpus_target_BEN)

这意味着您正在创建一个包含 MAL 和 BEN 的 main_corpus_target，您得到的错误是:

ValueError: Found input variables with inconsistent numbers of samples: [30, 222]

fraud_pred的样本数量为30，因此您应该使用30个数组来评估它们。main_corpus_target包含222。

观察您的代码，我发现您想要评估 X_test，它与 test_corpus X_test = vectorizer.transform(test_corpus) 相关。最好将结果与 test_corpus_target 进行比较，test_corpus_target 是数据集的目标变量，长度也为 30。您的这两行应该输出相同的长度:

test_corpus = main_corpus_MAL[(len(main_corpus_MAL)-(len(main_corpus_MAL)//(weight+1))):]
test_corpus_target = main_corpus_target_MAL[(len(main_corpus_MAL)-len(main_corpus_MAL)//(weight+1)):]

<小时/>

请问你为什么要自己计算TP、TN...？

您有一个更快的选择:

转换fragrant_pred系列，将-1替换为0。
使用 sklearn offers 的混淆矩阵函数。
使用 ravel 提取混淆矩阵的值。

示例，将 -1 转换为 0 后:

from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(fraud_pred, main_corpus_target['Category'].values).ravel()

此外，如果您使用的是最新的 pandas 版本:

from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(fraud_pred, main_corpus_target['Category'].to_numpy()).ravel()

关于machine-learning - sklearn oneclass svm KeyError，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59966570/

文章推荐： opencv - 给定图像帧中对象的坐标，查找图像中对象的深度

文章推荐： python - 更改 kmeans 模型的集群标签

virtual-machine - "process virtual machine"与 "system virtual machine"的区别
进程虚拟机和系统虚拟机有什么区别？我的猜测是，进程 VM 没有为该操作系统的整个应用程序提供一种操作系统，而是为某些特定应用程序提供环境。系统虚拟机为操作系统提供了一个安装环境，就像 Virtua
C# :Does Client machine need SQL Server installed on it while connecting to other machine having SQL Server installed on it (the Server machine)
我写了一个 C# windows 应用程序表单，它在客户端机器上运行并连接到另一台机器上的 SQL 服务器。在 C# 中建立连接时，我使用了像这样的 dll 1)microsoft.sqlserver
machine-learning - 线性回归标准化的影响: Machine Learning
作为我作业的一部分，我正在处理几个数据集，并通过线性回归查找它们的训练错误。我想知道标准化是否对训练误差有影响？对于标准化前后的数据集，我的相关性和 RMSE 是相等的。谢谢最佳答案很容易证明，
docker-machine - 无法使用 docker-machine 添加主机
我在公司数据中心的 linux VM 上安装了 docker-engine。我在 Windows 上安装了 docker-machine。我想通过我的 Windows 机器管理这个 docker-en
SAS 服务器 : How to get machine name of client machine?
我在我的 PC 上运行 SAS Enterprise Guide 以连接到位于我们网络内的服务器上的 SAS 实例。我正在编写一个将在服务器上运行的 SAS 程序，该程序将使用 ODS 将 HTML
machine.config - ASP.Net Machine.Config 转换
我正在创建一个包含 ASP.Net HttpModule 和 HttpHandler 的强签名类库。我已经为我的库创建了一个 visual studio 安装项目，该项目在 GAC 中安装了该库，但
docker-machine - 如何将现有的 Docker 服务器导入到 Docker Machine？
我试过 docker-machine create -d none --url tcp://:2376 remote并复制 {ca,key,cert}.pem (客户端证书)到机器目录。然后我做了 e
LLVM 代码生成器 : is Machine code representation machine-agnostic?
请注意 : 这个问题不是关于 LLVM IR , 但 LLVM 的 MIR ，一种低于前一种的内部中间表示。本文档关于 LLVM Machine code description classes ，
turing-machines - 有没有解决 "Construct a Turing machine ..."问题的简单方法？
我理解图灵机的逻辑。当给出图灵机时，我可以理解它是如何工作的以及它是如何停止的。但是当它被要求构造图灵机，难度更大。有什么简单的方法可以找到问题的答案，例如: Construct a Turing
math - "finite state machine"和 "state machine"之间有区别吗？
我不确定我是否理解有限状态机和状态机之间是否有区别？我是不是想得太难了？最佳答案 I'm not sure I understand if there is a difference between
docker-machine - 无法成功创建 docker 机器 : Error creating machine
我遵循 docker 入门教程并到达第 4 部分，您需要使用 virtualbox ( https://docs.docker.com/get-started/part4/#create-a-clus
virtual-machine - 如何在 Virtual Machine Manager 中启用 QEMU-Monitor 控制台？
我使用 Virtual Machine Manager 通过 QEMU-KVM 运行多个客户操作系统。我在某处读到，通过输入 ctrl+alt+2 应该会弹出监视器控制台。它不工作或禁用。有什么办法可
c - LNK1112 : module machine type 'IA64' conflicts with target machine type 'X86'
当我尝试在项目中包含 libc.lib 时，会出现此错误，即使我的 Windows 是 32 位，也会出现此错误。不知道我是否必须从某个地方下载它或什么。最佳答案您正在尝试链接为 IA64 架构编
machine-learning - 短语 "a machine learning algorithm learn a probability distribution"是什么意思？这里究竟发生了什么
生成模型和判别模型似乎可以学习条件 P(x|y) 和联合 P(x,y) 概率分布。但从根本上讲，我无法说服自己“学习概率分布”意味着什么。最佳答案这意味着您的模型要么充当训练样本的分布估计器，要么
opencv - 'LNK1112 : module machine type 'x64' conflicts with target machine type 'X86'
我正在使用 visual studio 2012.我得到了错误 LNK1112: module machine type 'x64' conflicts with target machine typ
macos - 如何修复 "error in run: Failed to get machine "boot2docker-vm": machine does not exist"?
使用 start|info|stop|delete 参数运行 boot2docker导致错误消息: snowch$ boot2docker start error in run: Failed to
azure - Vagrant-Azure : Guest machine can't connect to host machine (Unable to copy SMB files)
到目前为止，我一直只在本地使用 Vagrant，现在我想使用 Azure 作为提供程序来创建 VM，但不幸的是，我遇到了错误，可以在通过链接访问的图像上看到该错误。我明白它说的是什么，但我完全不知道如
c++ - 错误 LNK1112 : module machine type 'x64' conflicts with target machine type 'X86'
这个问题在这里已经有了答案: 关闭 10 年前。 Possible Duplicate: linking problem: fatal error LNK1112: module machine t
Node.js DGRAM 模块 : Cannot send UDP message to remote machine but can to local machine
我正在使用 Nodejs 的 dgram 模块运行一个简单的 UDP 服务器。相关代码很简单: server = dgram.createSocket('udp4'); serve
wix - 错误 LGHT0204 : ICE57: Component has both per-user and per-machine data with a per-machine KeyPath
嗨，我收到以下错误，导致构建失败，但在 bin 中创建了 Wix 安装程序 MSI。我怎样才能避免这些错误或抑制？错误 LGHT0204:ICE57:组件 'cmp52CD5A4CB5D668097

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

machine-learning - sklearn oneclass svm KeyError