gpt4 book ai didi

python - 使用 scikit-learn 的多类文本分类包时,predict() 和 Predict_proba() 之间的结果不一致

转载 作者:行者123 更新时间:2023-11-30 08:48:25 25 4
gpt4 key购买 nike

我正在研究一个多类文本分类问题,该问题必须提供前 5 个匹配项,而不仅仅是最佳匹配项。因此,“成功”被定义为前 5 个匹配中至少有一个是正确分类的。鉴于我们上面对成功的定义,该算法必须实现至少 95% 的成功率。当然,我们将在数据的子集上训练我们的模型,并在剩余的子集上进行测试,以验证我们的模型的成功。

我一直在使用 python 的 scikit-learn 的 Predict_proba() 函数来选择前 5 个匹配项,并使用自定义脚本计算下面的成功率,该脚本似乎在我的样本数据上运行良好,但是,我注意到顶部5 的成功率低于在我自己的自定义数据上使用 .predict() 获得的最高 1 的成功率,这在数学上是不可能的。这是因为排名靠前的结果将自动包含在排名前 5 的结果中,因此成功率至少必须等于排名前 1 的成功率(如果不是更高的话)。为了排除故障,我使用 Predict() 与 Predict_proba() 比较前 1 名的成功率,以确保它们相等,并确保前 5 名的成功率大于前 1 名。

我已经设置了下面的脚本来引导您了解我的逻辑,看看我是否在某个地方做出了错误的假设,或者我的数据是否存在需要修复的问题。我正在测试许多分类器和功能,但为了简单起见,您会看到我只是使用计数向量作为功能,使用逻辑回归作为分类器,因为我不相信(据我所知,这是问题的一部分) )。我非常感谢任何人可能需要解释为什么我发现这种差异的任何见解。

代码:

# Set up environment
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, model_selection
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd
import numpy as np

#Read in data and do just a bit of preprocessing

# User's Location of git repository
Git_Location = 'C:/Documents'

# Set Data Location:
data = Git_Location + 'Data.csv'

# load the data
df = pd.read_csv(data,low_memory=False,thousands=',', encoding='latin-1')
df = df[['CODE','Description']] #select only these columns
df = df.rename(index=float, columns={"CODE": "label", "Description": "text"})

#Convert label to float so you don't need to encode for processing later on
df['label']=df['label'].str.replace('-', '',regex=True, case = False).str.strip()
df['label'].astype('float64', raise_on_error = True)

# drop any labels with count LT 500 to build a strong model and make our testing run faster -- we will get more data later
df = df.groupby('label').filter(lambda x : len(x)>500)

#split data into testing and training
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df.text, df.label,test_size=0.33, random_state=6,stratify=df.label)

# Other examples online use the following data types... we will do the same to remain consistent
train_y_npar = pd.Series(train_y).values
train_x_list = pd.Series.tolist(train_x)
valid_x_list = pd.Series.tolist(valid_x)

# cast validation datasets to dataframes to allow to merging later on
valid_x_df = pd.DataFrame(valid_x)
valid_y_df = pd.DataFrame(valid_y)


# Extracting features from data
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_x_list)
X_test_counts = count_vect.transform(valid_x_list)

# Define the model training and validation function
def TV_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):

# fit the training dataset on the classifier
classifier.fit(feature_vector_train, label)

# predict the top n labels on validation dataset
n = 5
#classifier.probability = True
probas = classifier.predict_proba(feature_vector_valid)
predictions = classifier.predict(feature_vector_valid)

#Identify the indexes of the top predictions
top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]

#then find the associated SOC code for each prediction
top_class = classifier.classes_[top_n_predictions]

#cast to a new dataframe
top_class_df = pd.DataFrame(data=top_class)

#merge it up with the validation labels and descriptions
results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
results = pd.merge(results, top_class_df, left_index=True, right_index=True)


top5_conditions = [
(results.iloc[:,0] == results[0]),
(results.iloc[:,0] == results[1]),
(results.iloc[:,0] == results[2]),
(results.iloc[:,0] == results[3]),
(results.iloc[:,0] == results[4])]
top5_choices = [1, 1, 1, 1, 1]

#Top 1 Result
#top1_conditions = [(results['0_x'] == results[4])]
top1_conditions = [(results.iloc[:,0] == results[4])]
top1_choices = [1]

# Create the success columns
results['Top 5 Successes'] = np.select(top5_conditions, top5_choices, default=0)
results['Top 1 Successes'] = np.select(top1_conditions, top1_choices, default=0)

print("Are Top 5 Results greater than Top 1 Result?: ", (sum(results['Top 5 Successes'])/results.shape[0])>(metrics.accuracy_score(valid_y, predictions)))
print("Are Top 1 Results equal from predict() and predict_proba()?: ", (sum(results['Top 1 Successes'])/results.shape[0])==(metrics.accuracy_score(valid_y, predictions)))

print(" ")
print("Details: ")
print("Top 5 Accuracy Rate (predict_proba)= ", sum(results['Top 5 Successes'])/results.shape[0])
print("Top 1 Accuracy Rate (predict_proba)= ", sum(results['Top 1 Successes'])/results.shape[0])
print("Top 1 Accuracy Rate = (predict)=", metrics.accuracy_score(valid_y, predictions))

使用 scikit learn 内置的二十新闻组数据集的输出示例(这是我的目标):注意:我在另一个数据集上运行了这个确切的代码,并且能够产生这些结果,这告诉我该函数及其依赖项有效,因此问题一定以某种方式存在于数据中。

Are Top 5 Results greater than Top 1 Result?:  True 
Are Top 1 Results equal from predict() and predict_proba()?: True

详细信息:

Top 5 Accuracy Rate (predict_proba)=  0.9583112055231015 
Top 1 Accuracy Rate (predict_proba)= 0.8069569835369091
Top 1 Accuracy Rate = (predict)= 0.8069569835369091

现在运行我的数据:

TV_model(LogisticRegression(), X_train_counts, train_y_npar, X_test_counts, valid_y_df, valid_x_df)

输出:

Are Top 5 Results greater than Top 1 Result?:  False 
Are Top 1 Results equal from predict() and predict_proba()?: False

详细信息:

  • 前 5 名准确率 (predict_proba)= 0.6581632653061225
  • 排名前 1 的准确率 (predict_proba)= 0.2010204081632653
  • 前 1 个准确率 =(预测)= 0.8091187478734263

最佳答案

更新:找到解决方案!显然索引在某个时刻被重置。因此,我所需要做的就是在测试和训练拆分后重置验证数据集索引。

更新的代码:

# Set up environment
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, model_selection
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd
import numpy as np

#Read in data and do just a bit of preprocessing

# User's Location of git repository
Git_Location = 'C:/Documents'

# Set Data Location:
data = Git_Location + 'Data.csv'

# load the data
df = pd.read_csv(data,low_memory=False,thousands=',', encoding='latin-1')
df = df[['CODE','Description']] #select only these columns
df = df.rename(index=float, columns={"CODE": "label", "Description": "text"})

#Convert label to float so you don't need to encode for processing later on
df['label']=df['label'].str.replace('-', '',regex=True, case = False).str.strip()
df['label'].astype('float64', raise_on_error = True)

# drop any labels with count LT 500 to build a strong model and make our testing run faster -- we will get more data later
df = df.groupby('label').filter(lambda x : len(x)>500)

#split data into testing and training
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df.text, df.label,test_size=0.33, random_state=6,stratify=df.label)

#reset the index
valid_y = valid_y.reset_index(drop=True)
valid_x = valid_x.reset_index(drop=True)

# cast validation datasets to dataframes to allow to merging later on
valid_x_df = pd.DataFrame(valid_x)
valid_y_df = pd.DataFrame(valid_y)


# Extracting features from data
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train_x_list)
X_test_counts = count_vect.transform(valid_x_list)

# Define the model training and validation function
def TV_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):

# fit the training dataset on the classifier
classifier.fit(feature_vector_train, label)

# predict the top n labels on validation dataset
n = 5
#classifier.probability = True
probas = classifier.predict_proba(feature_vector_valid)
predictions = classifier.predict(feature_vector_valid)

#Identify the indexes of the top predictions
top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]

#then find the associated SOC code for each prediction
top_class = classifier.classes_[top_n_predictions]

#cast to a new dataframe
top_class_df = pd.DataFrame(data=top_class)

#merge it up with the validation labels and descriptions
results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
results = pd.merge(results, top_class_df, left_index=True, right_index=True)


top5_conditions = [
(results.iloc[:,0] == results[0]),
(results.iloc[:,0] == results[1]),
(results.iloc[:,0] == results[2]),
(results.iloc[:,0] == results[3]),
(results.iloc[:,0] == results[4])]
top5_choices = [1, 1, 1, 1, 1]

#Top 1 Result
#top1_conditions = [(results['0_x'] == results[4])]
top1_conditions = [(results.iloc[:,0] == results[4])]
top1_choices = [1]

# Create the success columns
results['Top 5 Successes'] = np.select(top5_conditions, top5_choices, default=0)
results['Top 1 Successes'] = np.select(top1_conditions, top1_choices, default=0)

print("Are Top 5 Results greater than Top 1 Result?: ", (sum(results['Top 5 Successes'])/results.shape[0])>(metrics.accuracy_score(valid_y, predictions)))
print("Are Top 1 Results equal from predict() and predict_proba()?: ", (sum(results['Top 1 Successes'])/results.shape[0])==(metrics.accuracy_score(valid_y, predictions)))

print(" ")
print("Details: ")
print("Top 5 Accuracy Rate (predict_proba)= ", sum(results['Top 5 Successes'])/results.shape[0])
print("Top 1 Accuracy Rate (predict_proba)= ", sum(results['Top 1 Successes'])/results.shape[0])
print("Top 1 Accuracy Rate = (predict)=", metrics.accuracy_score(valid_y, predictions))

关于python - 使用 scikit-learn 的多类文本分类包时,predict() 和 Predict_proba() 之间的结果不一致,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54972802/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com