python - 使用 scikit-learn 进行文档分类 : most efficient way to get the words (token) that impacted more on the classification-6ren

python - 使用 scikit-learn 进行文档分类 : most efficient way to get the words (token) that impacted more on the classification

转载作者：行者123 更新时间：2023-11-30 09:24:44

26

4

我使用文档训练集的 tf-idf 表示形式构建了一个文档二项式分类器，并对其应用逻辑回归:

lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])

lr_tfidf.fit(X_train, y_train)

我已将模型保存为 pickle 格式，并用它对新文档进行分类，从而得到文档属于 A 类的概率和模型属于 B 类的概率。

text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)

获取对分类影响更大的单词(或者一般来说，标记)的最佳方法是什么？我希望得到:

文档中包含的 N 个标记在逻辑回归模型中具有较高系数作为特征
文档中包含的 N 个标记在逻辑回归模型中具有较低系数作为特征

我正在使用 sklearn v 0.19

最佳答案

GitHub 上有一个解决方案，可以打印从管道内的分类器获得的最重要的特征:

https://gist.github.com/bbengfort/044682e76def583a12e6c09209c664a1

您想在其脚本中使用 show_most_informative_features 函数。我用过，效果很好。

以下是 Github 海报代码的复制粘贴:

def show_most_informative_features(model, text=None, n=20):

"""

Accepts a Pipeline with a classifer and a TfidfVectorizer and computes

the n most informative features of the model. If text is given, then will

compute the most informative features for classifying that text.



Note that this function will only work on linear models with coefs_

"""

# Extract the vectorizer and the classifier from the pipeline

vectorizer = model.named_steps['vectorizer']

classifier = model.named_steps['classifier']



# Check to make sure that we can perform this computation

if not hasattr(classifier, 'coef_'):

    raise TypeError(

        "Cannot compute most informative features on {} model.".format(

            classifier.__class__.__name__

        )

    )



if text is not None:

    # Compute the coefficients for the text

    tvec = model.transform([text]).toarray()

else:

    # Otherwise simply use the coefficients

    tvec = classifier.coef_



# Zip the feature names with the coefs and sort

coefs = sorted(

    zip(tvec[0], vectorizer.get_feature_names()),

    key=itemgetter(0), reverse=True

)



topn  = zip(coefs[:n], coefs[:-(n+1):-1])



# Create the output string to return

output = []



# If text, add the predicted value to the output.

if text is not None:

    output.append("\"{}\"".format(text))

    output.append("Classified as: {}".format(model.predict([text])))

    output.append("")



# Create two columns with most negative and most positive features.

for (cp, fnp), (cn, fnn) in topn:

    output.append(

        "{:0.4f}{: >15}    {:0.4f}{: >15}".format(cp, fnp, cn, fnn)

    )



return "\n".join(output)

关于python - 使用 scikit-learn 进行文档分类 : most efficient way to get the words (token) that impacted more on the classification，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48401148/

26

4

0

文章推荐： javascript - ES6 调用静态方法抛出错误

文章推荐： java - 在 Parcelable 中写入/读取数组或字符串数组

文章推荐： python - 估算测试集的缺失值

mysql - InnoDB 关系 : one-way or two-way?
我第一次决定切换到 InnoDB 并尝试使用外键和其他 InnoDB 功能。创建关系时，我应该只在一张表上声明它们吗？还是两个表？例如，对于以下每种情况，您将在何处以及如何声明关系？ 1 用户有很
Ajax 爬行 : old way vs new way (#! )
老方法当我以前在需要内容被搜索引擎索引的项目中异步加载页面时，我使用了一种非常简单的技术，那就是 Page $('#example').click(function(){
java - JUnit核心 : the best way to test java files or are there better ways?
我目前正在为自己创建自己的自定义应用程序来编译 Java 文件。我的应用程序可以完美地编译 Java 文件，但现在我想开始为 Java 文件添加某种类型的测试(例如，我将一些变量传递给许多不同的文件
ios - 苹果手机 : is there any secure way to establish 2-way SSL from an application
我需要建立从我的 iPhone 应用程序到客户服务器的 HTTPS 双向 SSL 连接。但是我没有看到任何安全的方式来将客户端证书传递给应用程序(这是一个电子银行应用程序，所以安全性确实是一个问题)。
design-patterns - Scala 程序员 - "should there be one obvious way to do it"还是 "more than one way to do it"？
我从事 Java 工作已经很长时间了，大约 6 个月前开始使用 Scala。我喜欢这门语言。我发现的一件事是，做事有多种方法。我不知道这是因为该语言的性质，还是因为它还很年轻并且在不断发展，习惯用法和
java - 如何编写一个通用的 java 客户端来处理 1-way 和 2-way ssl？
这是我所指的示例代码。 https://sites.google.com/site/ssljavaguide/example-code/2-way-ssl 我是否可以不设置与 keystore 相关的
PHP 和 MySQL 安全 : one-way encryption Vs two-way encryption
我读过有关使用 MySQL AES_ENCRYPT/AES_DECRYPT(双向加密)不如使用 PHP - hash()(单向加密)安全的信息。 http://bytes.com/topic/php/
Does TIGER/Line road dataset contain one-way/two-way traffic information?(TIGER/LINE道路数据集是否包含单向/双向交通信息？)
我正在进行一个路线选择项目，我需要使用道路类型和单向/双向交通信息填充道路网络。我想知道Tiger/Line道路数据集是否包含这样的数据。。我下载了加利福尼亚州的Tiger/Line道路数据集，但没有
iPhone方向管理: what is the most efficient way to do?
我需要开发一个 iPad 应用程序，它应该管理两种方向模式(横向和纵向)。根据 official Apple iOS documentation , 有 2 种方法可以继续。 -第一个包括在收到旋转
randomForest模型大小取决于训练集大小: A way to avoid?
我正在训练一个 randomForest 模型，目的是保存它以进行预测(它将被下载并在外部上下文中使用)。我希望这个模型尽可能最小。我读到有很多options和 packages减少模型的内存大小。
javascript - Way 不会将参数传递给匿名函数导致它返回一个函数
为什么将参数传递给匿名函数会影响结果？例如，下面的脚本将 a1 显示为 function()，将 a2 显示为数组。 var a1=(function(){return [1*2,2*2,3*2];}
具有过滤功能的Python列表(excel-way)
我有一个 Python 列表: listx = [["a", 127, "Blue", 0], ["b", 127, "Red", 1], ["b", 127, "
java - 在构造函数中初始化变量后的验证 : why not the other way?
在查看 Java 库时，特别是构造函数，我注意到字段通常会出于某种原因进行初始化和验证: public java.awt.Color(int r, int g, int b, int a) {
Git 脚本 : ways to do it
我想编写 Git 脚本。只创建一个 Unix 脚本是最好的方法吗？ #!/bin/bashgit push origin master &&git checkout develop &&git mer
java - "should be accessed in a static way"
这个问题在这里已经有了答案: class or method alias in java (8 个回答) 去年关闭。我有一个类的名称可能不必要地繁琐，其中包含许多我在其他地方使用的静态方法。而不是
python - 函数参数验证 : what is the pythonic way?
这个问题在这里已经有了答案: Best way to check function arguments? [closed] (14 个回答) Parameter validation, Best pr
C 结构初始化 : strange way
在阅读我遇到的代码时，结构的以下定义和初始化: // header file struct foo{ char* name; int value; }; //Implementation file s
drupal - "Drupal way"到向导步骤？
我正在使用多页表单方法在 Drupal 中开发一个自定义模块，并且我希望对步骤进行可视化。步骤 1 > [_Step_2_] > 步骤 3 > 完成商业规则: 他们总是能看到所有的步骤，以及他们现
angularjs - "angular way"用于动态添加指令
Josh 的 answer 给我留下了深刻的印象关于客户端的“Angular 方式”和声明式风格。但是你能帮我理解一下，怎么做吗: 我有一个单页应用程序，左侧是菜单栏，右侧是 div 容器。当用户
mercurial - 在没有部分提交的情况下执行 "Mercurial way"
Subversion 商店正在考虑改用 Mercurial，试图提前弄清楚开发人员的所有提示将是什么。这里有一个相当常见的用例，我不知道如何处理。我正在研究一些较大的功能，我有一个重要的代码部分——

首页

博学

6Ren·AI

商城

python - 使用 scikit-learn 进行文档分类 : most efficient way to get the words (token) that impacted more on the classification