gpt4 book ai didi

python - 使用 scikit-learn 进行文档分类 : most efficient way to get the words (token) that impacted more on the classification

转载 作者:行者123 更新时间:2023-11-30 09:24:44 26 4
gpt4 key购买 nike

我使用文档训练集的 tf-idf 表示形式构建了一个文档二项式分类器,并对其应用逻辑回归:

lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])

lr_tfidf.fit(X_train, y_train)

我已将模型保存为 pickle 格式,并用它对新文档进行分类,从而得到文档属于 A 类的概率和模型属于 B 类的概率。

text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)

获取对分类影响更大的单词(或者一般来说,标记)的最佳方法是什么?我希望得到:

  • 文档中包含的 N 个标记在逻辑回归模型中具有较高系数作为特征
  • 文档中包含的 N 个标记在逻辑回归模型中具有较低系数作为特征

我正在使用 sklearn v 0.19

最佳答案

GitHub 上有一个解决方案,可以打印从管道内的分类器获得的最重要的特征:

https://gist.github.com/bbengfort/044682e76def583a12e6c09209c664a1

您想在其脚本中使用 show_most_informative_features 函数。我用过,效果很好。

以下是 Github 海报代码的复制粘贴:

def show_most_informative_features(model, text=None, n=20):

"""

Accepts a Pipeline with a classifer and a TfidfVectorizer and computes

the n most informative features of the model. If text is given, then will

compute the most informative features for classifying that text.



Note that this function will only work on linear models with coefs_

"""

# Extract the vectorizer and the classifier from the pipeline

vectorizer = model.named_steps['vectorizer']

classifier = model.named_steps['classifier']



# Check to make sure that we can perform this computation

if not hasattr(classifier, 'coef_'):

raise TypeError(

"Cannot compute most informative features on {} model.".format(

classifier.__class__.__name__

)

)



if text is not None:

# Compute the coefficients for the text

tvec = model.transform([text]).toarray()

else:

# Otherwise simply use the coefficients

tvec = classifier.coef_



# Zip the feature names with the coefs and sort

coefs = sorted(

zip(tvec[0], vectorizer.get_feature_names()),

key=itemgetter(0), reverse=True

)



topn = zip(coefs[:n], coefs[:-(n+1):-1])



# Create the output string to return

output = []



# If text, add the predicted value to the output.

if text is not None:

output.append("\"{}\"".format(text))

output.append("Classified as: {}".format(model.predict([text])))

output.append("")



# Create two columns with most negative and most positive features.

for (cp, fnp), (cn, fnn) in topn:

output.append(

"{:0.4f}{: >15} {:0.4f}{: >15}".format(cp, fnp, cn, fnn)

)



return "\n".join(output)

关于python - 使用 scikit-learn 进行文档分类 : most efficient way to get the words (token) that impacted more on the classification,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48401148/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com