machine-learning - 一类 SVM 算法耗时太长-6ren

machine-learning - 一类 SVM 算法耗时太长

转载作者：行者123 更新时间：2023-12-03 14:53:56

下面的数据显示了我的数据集的一部分，用于检测异常

    describe_file   data_numbers    index
0   gkivdotqvj      7309.0          0
1   hpwgzodlky      2731.0          1
2   dgaecubawx      0.0             2
3   NaN             0.0             3
4   lnpeyxsrrc      0.0             4

我使用了一类 SVM 算法来检测异常

from pyod.models.ocsvm import OCSVM
random_state = np.random.RandomState(42)     
outliers_fraction = 0.05
classifiers = {
        'One Classify SVM (SVM)':OCSVM(kernel='rbf', degree=3, gamma='auto', coef0=0.0, tol=0.001, nu=0.5, shrinking=True, cache_size=200, verbose=False, max_iter=-1, contamination=outliers_fraction)
}

X = data['data_numbers'].values.reshape(-1,1)   

for i, (clf_name, clf) in enumerate(classifiers.items()):
    clf.fit(X)
    # predict raw anomaly score
    scores_pred = clf.decision_function(X) * -1

    # prediction of a datapoint category outlier or inlier
    y_pred = clf.predict(X)
    n_inliers = len(y_pred) - np.count_nonzero(y_pred)
    n_outliers = np.count_nonzero(y_pred == 1)

    # copy of dataframe
    dfx = data[['index', 'data_numbers']]
    dfx['outlier'] = y_pred.tolist()
    IX1 =  np.array(dfx['data_numbers'][dfx['outlier'] == 0]).reshape(-1,1)
    OX1 =  dfx['data_numbers'][dfx['outlier'] == 1].values.reshape(-1,1)         
    print('OUTLIERS : ',n_outliers,'INLIERS : ',n_inliers, clf_name)    
    # threshold value to consider a datapoint inlier or outlier
    threshold = stats.scoreatpercentile(scores_pred,100 * outliers_fraction) 

tOut = stats.scoreatpercentile(dfx[dfx['outlier'] == 1]['data_numbers'], np.abs(threshold))

y = dfx['outlier'].values.reshape(-1,1)
def severity_validation():
    tOUT10 = tOut+(tOut*0.10)    
    tOUT23 = tOut+(tOut*0.23)
    tOUT45 = tOut+(tOut*0.45)
    dfx['test_severity'] = "None"
    for i, row in dfx.iterrows():
        if row['outlier']==1:
            if row['data_numbers'] <=tOUT10:
                dfx['test_severity'][i] = "Low Severity" 
            elif row['data_numbers'] <=tOUT23:
                dfx['test_severity'][i] = "Medium Severity" 
            elif row['data_numbers'] <=tOUT45:
                dfx['test_severity'][i] = "High Severity" 
            else:
                dfx['test_severity'][i] = "Ultra High Severity" 

severity_validation()

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(dfx[['index','data_numbers']], dfx.outlier, test_size=0.25, 
                                                    stratify=dfx.outlier, random_state=30)

#Instantiate Classifier
normer = preprocessing.Normalizer()
svm1 = svm.SVC(probability=True, class_weight={1: 10})

cached = mkdtemp()
memory = Memory(cachedir=cached, verbose=3)
pipe_1 = Pipeline(steps=[('normalization', normer), ('svm', svm1)], memory=memory)

cv = skl.model_selection.KFold(n_splits=5, shuffle=True, random_state=42)

param_grid = [ {"svm__kernel": ["linear"], "svm__C": [0.5]}, {"svm__kernel": ["rbf"], "svm__C": [0.5], "svm__gamma": [5]} ]
grd = GridSearchCV(pipe_1, param_grid, scoring='roc_auc', cv=cv)

#Training
y_pred = grd.fit(X_train, Y_train).predict(X_test)
rmtree(cached)

#Evaluation
confmatrix = skl.metrics.confusion_matrix(Y_test, y_pred)
print(confmatrix)
Y_pred = grd.fit(X_train, Y_train).predict_proba(X_test)[:,1] 
def plot_roc(y_test, y_pred):
    fpr, tpr, thresholds = skl.metrics.roc_curve(y_test, y_pred, pos_label=1)
    roc_auc = skl.metrics.auc(fpr, tpr)
    plt.figure()
    lw = 2
    plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area ={0:.2f})'.format(roc_auc))
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show();
plot_roc(Y_test, Y_pred)

我的数据集很大，有数百万行。结果我只能运行几十万行。
代码工作得很好，但是它花费的时间太长，所以我希望能得到一些优化建议，以便我运行得更快。

最佳答案

SVM 训练时间随着样本数量的增加而严重扩展，通常为 O(n^2) 或更糟。因此它不适用于具有数百万个样本的数据集。可以找到一些用于探索的示例代码 here .

我建议尝试改为 IsolationForest ，它速度快，性能好。

如果您想使用 SVM，请对您的数据集进行子采样，以便拥有 10-100k 个样本。线性内核的训练速度也明显快于 RBF，但在大量样本的情况下仍然具有较差的可扩展性。

关于machine-learning - 一类 SVM 算法耗时太长，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60724226/

文章推荐： r - 包开发: Multiple files or Single File

文章推荐： python - 要求用户输入直到他们给出有效的响应

sql - MySQL更新花费(太)长的时间
在我们的服务出现一些预期的增长之后，突然间一些更新花费了非常长的时间，这些过去非常快，直到表达到大约 2MM 记录，现在它们每个需要大约 40-60 秒。 update table1 set fiel
java - 如果传感器更新(太)慢，服务就会被终止
我在服务中实现了一个传感器事件监听器，只要采样周期和最大报告延迟低于 1 秒，该监听器就可以正常工作，但一旦我将采样周期增加到超过 1 秒，传感器就根本不会更新。我希望采样周期为 10 秒(可能是
python - 我的 Tkinter GUI 太...静态？
我使用 Tkinter GUI 来启动测量和分析过程，基本上只需单击一个按钮即可开始。由于这些测量可能需要一段时间，我尝试添加一个进度条，即这个: http://tkinter.unpythonic.
c++ - 无法转换 Omnet++::Packet 太 Inet::Packet
我正在尝试使用套接字发送数据包，但出现错误。 invalid conversion from ‘omnetpp::cPacket*’ to ‘inet::Packet*’ [-fpermissive]
ruby - 为什么 split (' ' ) 试图变得(太)聪明？
我刚刚发现 String#split 有以下奇怪的行为: "a\tb c\nd".split => ["a", "b", "c", "d"] "a\tb c\nd".split(' ') => ["a
clojurescript - 使用 Klipse 和 Reagent 的 clojurescript 中的时间间隔非常(太)快
您好，我正在尝试 ClojureScript，我正在使用 Klipse作为我的 REPL 差不多。这可能不是它的预期用途，但因为我没有做任何太复杂的事情，所以现在没问题。我遇到的一个问题是尝试设置计
watchkit - 对于 Watch Complication 和 Time Travel，getTimelineEntriesForComplication 被(太)经常调用
根据下面的数据，ClockKit 会生成一次 future 的 CLKComplicationTimelineEntry 项，但对于过去的时间点，会进行 24 次调用!这是为什么？更多详情: 我注意
javascript - Bookshelf.js/Knex.js 太 "helpful"与 UTC DATETIME 列
我有一个 MySQL 表，这个表有一个名为 datetime_utc 的 DATETIME 列。如您所料，它是 UTC 日期和时间。在我的 Bookshelf 模型中，我定义了一个虚拟 getter，
别再用 System.currentTimeMillis 统计耗时了，太 Low，试试 Spring Boot 源码在用的 StopWatch吧，够优雅
大家好，我是二哥呀！昨天，一位球友问我能不能给他解释一下 @SpringBootApplication 注解是什么意思，还有 Spring Boot 的运行原理，于是我就带着他扒拉了一下这个注解的源

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

machine-learning - 一类 SVM 算法耗时太长