python - 使用 scikit-learn 的多类文本分类包时，predict() 和 Predict

python - 使用 scikit-learn 的多类文本分类包时，predict() 和 Predict_proba() 之间的结果不一致

转载作者：行者123 更新时间：2023-11-30 08:48:25

我正在研究一个多类文本分类问题，该问题必须提供前 5 个匹配项，而不仅仅是最佳匹配项。因此，“成功”被定义为前 5 个匹配中至少有一个是正确分类的。鉴于我们上面对成功的定义，该算法必须实现至少 95% 的成功率。当然，我们将在数据的子集上训练我们的模型，并在剩余的子集上进行测试，以验证我们的模型的成功。

我一直在使用 python 的 scikit-learn 的 Predict_proba() 函数来选择前 5 个匹配项，并使用自定义脚本计算下面的成功率，该脚本似乎在我的样本数据上运行良好，但是，我注意到顶部5 的成功率低于在我自己的自定义数据上使用 .predict() 获得的最高 1 的成功率，这在数学上是不可能的。这是因为排名靠前的结果将自动包含在排名前 5 的结果中，因此成功率至少必须等于排名前 1 的成功率(如果不是更高的话)。为了排除故障，我使用 Predict() 与 Predict_proba() 比较前 1 名的成功率，以确保它们相等，并确保前 5 名的成功率大于前 1 名。

我已经设置了下面的脚本来引导您了解我的逻辑，看看我是否在某个地方做出了错误的假设，或者我的数据是否存在需要修复的问题。我正在测试许多分类器和功能，但为了简单起见，您会看到我只是使用计数向量作为功能，使用逻辑回归作为分类器，因为我不相信(据我所知，这是问题的一部分) )。我非常感谢任何人可能需要解释为什么我发现这种差异的任何见解。

代码:

# Set up environment
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, model_selection
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd
import numpy as np

#Read in data and do just a bit of preprocessing

# User's Location of git repository
Git_Location = 'C:/Documents'

# Set Data Location:
data = Git_Location + 'Data.csv'

# load the data
df = pd.read_csv(data,low_memory=False,thousands=',', encoding='latin-1')
df = df[['CODE','Description']] #select only these columns
df = df.rename(index=float, columns={"CODE": "label", "Description": "text"})

#Convert label to float so you don't need to encode for processing later on
df['label']=df['label'].str.replace('-', '',regex=True, case = False).str.strip()
df['label'].astype('float64', raise_on_error = True)

# drop any labels with count LT 500 to build a strong model and make our testing run faster -- we will get more data later
df = df.groupby('label').filter(lambda x : len(x)>500)

#split data into testing and training
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df.text, df.label,test_size=0.33, random_state=6,stratify=df.label)

# Other examples online use the following data types... we will do the same to remain consistent
train_y_npar = pd.Series(train_y).values
train_x_list = pd.Series.tolist(train_x)
valid_x_list = pd.Series.tolist(valid_x)

# cast validation datasets to dataframes to allow to merging later on
valid_x_df = pd.DataFrame(valid_x)
valid_y_df = pd.DataFrame(valid_y)


# Extracting features from data
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_x_list)
X_test_counts = count_vect.transform(valid_x_list)

# Define the model training and validation function
def TV_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):

    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the top n labels on validation dataset
    n = 5
    #classifier.probability = True
    probas = classifier.predict_proba(feature_vector_valid)
    predictions = classifier.predict(feature_vector_valid)

    #Identify the indexes of the top predictions
    top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]

    #then find the associated SOC code for each prediction
    top_class = classifier.classes_[top_n_predictions]

    #cast to a new dataframe
    top_class_df = pd.DataFrame(data=top_class)

    #merge it up with the validation labels and descriptions
    results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
    results = pd.merge(results, top_class_df, left_index=True, right_index=True)


    top5_conditions = [
        (results.iloc[:,0] == results[0]),
        (results.iloc[:,0] == results[1]),
        (results.iloc[:,0] == results[2]),
        (results.iloc[:,0] == results[3]),
        (results.iloc[:,0] == results[4])]
    top5_choices = [1, 1, 1, 1, 1]

    #Top 1 Result
    #top1_conditions = [(results['0_x'] == results[4])]
    top1_conditions = [(results.iloc[:,0] == results[4])]
    top1_choices = [1]

    # Create the success columns
    results['Top 5 Successes'] = np.select(top5_conditions, top5_choices, default=0)
    results['Top 1 Successes'] = np.select(top1_conditions, top1_choices, default=0)

    print("Are Top 5 Results greater than Top 1 Result?: ", (sum(results['Top 5 Successes'])/results.shape[0])>(metrics.accuracy_score(valid_y, predictions)))
   print("Are Top 1 Results equal from predict() and predict_proba()?: ", (sum(results['Top 1 Successes'])/results.shape[0])==(metrics.accuracy_score(valid_y, predictions)))

    print(" ")
    print("Details: ")
    print("Top 5 Accuracy Rate (predict_proba)= ", sum(results['Top 5 Successes'])/results.shape[0])
    print("Top 1 Accuracy Rate (predict_proba)= ", sum(results['Top 1 Successes'])/results.shape[0])
    print("Top 1 Accuracy Rate = (predict)=", metrics.accuracy_score(valid_y, predictions))

使用 scikit learn 内置的二十新闻组数据集的输出示例(这是我的目标):注意:我在另一个数据集上运行了这个确切的代码，并且能够产生这些结果，这告诉我该函数及其依赖项有效，因此问题一定以某种方式存在于数据中。

Are Top 5 Results greater than Top 1 Result?:  True 
Are Top 1 Results equal from predict() and predict_proba()?:  True

详细信息:

Top 5 Accuracy Rate (predict_proba)=  0.9583112055231015 
Top 1 Accuracy Rate (predict_proba)=  0.8069569835369091 
Top 1 Accuracy Rate = (predict)= 0.8069569835369091

现在运行我的数据:

TV_model(LogisticRegression(), X_train_counts, train_y_npar, X_test_counts, valid_y_df, valid_x_df)

输出:

Are Top 5 Results greater than Top 1 Result?:  False 
Are Top 1 Results equal from predict() and predict_proba()?:  False

详细信息:

前 5 名准确率 (predict_proba)= 0.6581632653061225
排名前 1 的准确率 (predict_proba)= 0.2010204081632653
前 1 个准确率 =(预测)= 0.8091187478734263

最佳答案

更新:找到解决方案!显然索引在某个时刻被重置。因此，我所需要做的就是在测试和训练拆分后重置验证数据集索引。

更新的代码:

# Set up environment
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, model_selection
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd
import numpy as np

#Read in data and do just a bit of preprocessing

# User's Location of git repository
Git_Location = 'C:/Documents'

# Set Data Location:
data = Git_Location + 'Data.csv'

# load the data
df = pd.read_csv(data,low_memory=False,thousands=',', encoding='latin-1')
df = df[['CODE','Description']] #select only these columns
df = df.rename(index=float, columns={"CODE": "label", "Description": "text"})

#Convert label to float so you don't need to encode for processing later on
df['label']=df['label'].str.replace('-', '',regex=True, case = False).str.strip()
df['label'].astype('float64', raise_on_error = True)

# drop any labels with count LT 500 to build a strong model and make our testing run faster -- we will get more data later
df = df.groupby('label').filter(lambda x : len(x)>500)

#split data into testing and training
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df.text, df.label,test_size=0.33, random_state=6,stratify=df.label)

#reset the index 
valid_y = valid_y.reset_index(drop=True)
valid_x = valid_x.reset_index(drop=True)

# cast validation datasets to dataframes to allow to merging later on
valid_x_df = pd.DataFrame(valid_x)
valid_y_df = pd.DataFrame(valid_y)


# Extracting features from data
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_x_list)
X_test_counts = count_vect.transform(valid_x_list)

# Define the model training and validation function
def TV_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):

    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the top n labels on validation dataset
    n = 5
    #classifier.probability = True
    probas = classifier.predict_proba(feature_vector_valid)
    predictions = classifier.predict(feature_vector_valid)

    #Identify the indexes of the top predictions
    top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]

    #then find the associated SOC code for each prediction
    top_class = classifier.classes_[top_n_predictions]

    #cast to a new dataframe
    top_class_df = pd.DataFrame(data=top_class)

    #merge it up with the validation labels and descriptions
    results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
    results = pd.merge(results, top_class_df, left_index=True, right_index=True)


    top5_conditions = [
        (results.iloc[:,0] == results[0]),
        (results.iloc[:,0] == results[1]),
        (results.iloc[:,0] == results[2]),
        (results.iloc[:,0] == results[3]),
        (results.iloc[:,0] == results[4])]
    top5_choices = [1, 1, 1, 1, 1]

    #Top 1 Result
    #top1_conditions = [(results['0_x'] == results[4])]
    top1_conditions = [(results.iloc[:,0] == results[4])]
    top1_choices = [1]

    # Create the success columns
    results['Top 5 Successes'] = np.select(top5_conditions, top5_choices, default=0)
    results['Top 1 Successes'] = np.select(top1_conditions, top1_choices, default=0)

    print("Are Top 5 Results greater than Top 1 Result?: ", (sum(results['Top 5 Successes'])/results.shape[0])>(metrics.accuracy_score(valid_y, predictions)))
   print("Are Top 1 Results equal from predict() and predict_proba()?: ", (sum(results['Top 1 Successes'])/results.shape[0])==(metrics.accuracy_score(valid_y, predictions)))

    print(" ")
    print("Details: ")
    print("Top 5 Accuracy Rate (predict_proba)= ", sum(results['Top 5 Successes'])/results.shape[0])
    print("Top 1 Accuracy Rate (predict_proba)= ", sum(results['Top 1 Successes'])/results.shape[0])
    print("Top 1 Accuracy Rate = (predict)=", metrics.accuracy_score(valid_y, predictions))

关于python - 使用 scikit-learn 的多类文本分类包时，predict() 和 Predict_proba() 之间的结果不一致，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54972802/

文章推荐： java - 如何将键/值组与正则表达式匹配

文章推荐： java - 无法从 SDN4 检索一组父类(super class)型对象

文章推荐： machine-learning - 任意大 Action /状态空间中的强化学习

objective-c - 以编程方式制作 mac 包/包
通过终端，您可以使用命令 - “SetFile -a B 文件名” 以编程方式，我认为我应该通过[[NSFileManager defaultManager] createDirectoryAtPat
r - 包 igraph0 已弃用，因此无法访问 gspan 包
嗨，正在尝试书中的一些示例:Practical Graph mining with R对于子图挖掘: library(subgraphMining) library(igraph) graph1 =
java - 具有默认(包)访问级别的类中方法的默认(包)和公共(public)访问级别之间有什么区别吗？
代码中的相同问题: class Foo { int getIntProperty () { ... } CustomObject getObjectProperty () { ... }
javascript - react | Npm 包 - 如何导出 2 个组件以用作 npm 包
所以这可能是一个愚蠢的问题，但它已经困扰我一段时间了。使用 React，我创建了两个组件(Buttons.js 和 Message.js)，每个组件都有一个导出。但是，现在我希望将这两个组件用作 n
node.js - 无法在某个范围内安装 NPM 包，或者无法在范围内安装 NPM 包(或者看起来像这样)
从今天早上开始，我发现我无法再从某个范围安装任何 NPM 包(或任何具有依赖项的包)。例如，如果我输入 npm i webpack 我会收到以下错误... npm ERR! code E401 npm
angular - 找不到本地 "typescript"包。 "@ngtools/webpack"包 Angular 2
我在这里搜索过，Angular 2, @ngtools/webpack, AOT ，但对我不起作用。我运行了 npm install 命令。我正在做的是创建一个新的 Angular 2 项目。当我运行
swift - 集成具有本地 Swift 包 : how to avoid invalidManifestFormat errors? 的远程 Swift 包
情况: 我有一个 Swift 包，将其命名为 lib。 lib 位于其自己的存储库中。在lib的仓库中，有一堆本地包；也就是说，这些包是在 lib 中定义的，使用本地路径依赖格式 .package(p
node.js - 如何安装完整的 Node JS 包，从而避免使用 npm 来安装模块/包？
我想在工作中学习和使用nodejs，但是在使用 de npm 命令安装模块/包时遇到网络问题。我是否可以使用我的家用计算机构建完整的 Node js 包，然后将其安装在另一台计算机(我的工作场所计算机
python - 如何将非 Python 包 (.tar.bz2) 安装/转换为 Anaconda 包？
我需要将一些 .tar.bz2 格式的非 Python 包转换为 Anaconda/miniConda .egg 文件并安装它们。为此，我需要一个适用于 Windows 的 bld.bat 文件。互联
c++ - thrift-0.9.3 包 C++ 构建问题。使用哪些 boost 包？
我需要共享库文件 libthrift-0.9.3.so 作为其他包的依赖项。我在构建 thrift-0.9.3 包时看到编译问题(我确实从 https://thrift.apache.org/down
r - 在 R 版本 3.5.0 中安装 arcgisbinding 包，收到警告 : as ‘lib’ is unspecified, 包 ‘‘arcgisbinding’ 不可用
我尝试在 R 版本 3.5.0 中安装“arcgisbinding”包。但是我失败了，得到以下错误和警告。 Installing package into ‘C:/Users/Lenovo/Docum
r - 在 R 版本 3.5.0 中安装 arcgisbinding 包，收到警告 : as ‘lib’ is unspecified, 包 ‘‘arcgisbinding’ 不可用
我尝试在 R 版本 3.5.0 中安装“arcgisbinding”包。但是我失败了，得到以下错误和警告。 Installing package into ‘C:/Users/Lenovo/Docum
android - "The name ' 页 ' is defined in the libraries ' 包 :burn_off/widgets/page. Dart ' and ' 包 :flutter/src/widgets/navigator. Dart '
我试图在 flutter 中测试这个应用程序，但我无法运行该应用程序，因为出现此错误“名称‘Page’在库‘package:burn_off/widgets/page.dart’和‘package’中
包/模块之间的python变量共享
试图理解和学习如何编写包...用我一直使用的东西进行测试，记录... 您能帮我理解为什么“日志”变量不起作用...并且屏幕上没有日志记录吗？谢谢! 主要文件: #!/opt/local/bin/py
Python 包
我尝试运行此使用 Google 云的代码。 import signal import sys from google.cloud import language, exceptions # creat
用于分析眼动追踪数据的 R 包
我想知道是否有人找到了一个很好的 R 包来分析眼动追踪数据？我遇到了 eyetrackR，但据我所知，没有可用的英文支持文档: http://read.psych.uni-potsdam.de/pm
R 包 - 我可以在包中使用全局变量吗？
我正在 R 上制作一个包。我有两个函数共享一个变量(全局)。如何将其导入到包中？例如， m<-0 f<-function() { m <- m+1 } g<-function() { m <- m
包含子包的 Lua 包
我用 C 为 Lua 编写了很多模块。每个模块都包含一个 Lua 用户数据类型，我像这样加载和使用它们: A = require("A") B = require("B") a = A.new(3,{
rubuntu xlsx 包
我正在尝试在 R 中的 Ubuntu 上安装 xlsx 包，以便使用允许在 R 中插入链接然后将它们导出到 Excel 的功能。话虽如此，我根本无法安装该软件包。显然它必须与 rJava 一起使用
用于从标准概率分布中采样的 Haskell 包
我想在 Haskell 中做一些蒙特卡洛分析。我希望能够编写这样的代码: do n <- poisson lambda xs <- replicateM n $ normal mu sigma

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 使用 scikit-learn 的多类文本分类包时，predict() 和 Predict_proba() 之间的结果不一致