Python Pandas NLTK 从 Dataframe 'join() argument' 错误的文本字段中提取常用短语 (ngrams)-6ren

Python Pandas NLTK 从 Dataframe 'join() argument' 错误的文本字段中提取常用短语 (ngrams)

转载作者：行者123 更新时间：2023-11-28 17:02:23

25

4

我有以下示例数据框:

No  category    problem_definition_stopwords
175 2521       ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438       ['galley', 'work', 'table', 'stuck']
912 2698       ['cloth', 'stuck']
572 2521       ['stuck', 'coffee']

'problem_definition_stopwords' 字段已经被标记化，删除了停用词。

我想从“problem_definition_stopwords”字段创建 n-gram。具体来说，我想从我的数据中提取 n-gram 并找到具有最高点智能互信息 (PMI) 的那些。

本质上，我更想找到同时出现的词，这比我偶然发现的要多得多。

我尝试了以下代码:

import nltk
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# errored out here 
finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words(df['problem_definition_stopwords']))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3) 

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)

我收到的错误是在第三段代码上......TypeError: join() 参数必须是 str 或 bytes，而不是 'list'

编辑:DataFrame 更便携的格式:

>>> df.columns
Index(['No', 'category', 'problem_definition_stopwords'], dtype='object')
>>> df.to_dict()
{'No': {0: 175, 1: 211, 2: 912, 3: 572}, 'category': {0: 2521, 1: 1438, 2: 2698, 3: 2521}, 'problem_definition_stopwords': {0: ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'], 1: ['galley', 'work', 'table', 'stuck'], 2: ['cloth', 'stuck'], 3: ['stuck', 'coffee']}}

最佳答案

您似乎没有使用 from_words正确调用，看help(nltk.corpus.genesis.words)

Help on method words in module nltk.corpus.reader.plaintext:

words(fileids=None) method of nltk.corpus.reader.plaintext.PlaintextCorpusReader instance
    :return: the given file(s) as a list of words
        and punctuation symbols.
    :rtype: list(str)
(END)

这是您要找的吗？由于您已经将文档表示为字符串列表，根据我的经验，这与 NLTK 配合得很好，我认为您可以使用 from_documents方法:

finder = BigramCollocationFinder.from_documents(
    df['problem_definition_stopwords']
)

# only bigrams that appear 3+ times
# Note, I limited this to 1 since the corpus you provided
# is very small and it'll be tough to find repeat ngrams
finder.apply_freq_filter(1) 

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10) 

[('brewing', 'properly'), ('galley', 'work'), ('maker', 'brewing'), ('properly', '2'), ('work', 'table'), ('coffee', 'maker'), ('2', '420'), ('cloth', 'stuck'), ('table', 'stuck'), ('420', '420')]

关于Python Pandas NLTK 从 Dataframe 'join() argument' 错误的文本字段中提取常用短语 (ngrams)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53560439/

25

4

0

文章推荐： javascript - AngularJS 中的复选框不会触发 onChange 事件

文章推荐： python - 读取json文件时处理条件语句python中的KeyError

文章推荐： Python 在不知道 key 名称的情况下下载 S3 文件

c - 应用程序接受 : *argument but not of the form argument* or *argument* 形式的命令行参数
例如，如果我的程序名称是 test.c 然后对于以下运行命令，argc = 2 而不是 4。 $test abc pqr* *xyz* 最佳答案尝试运行: $ echo abc pqr* *xyz*
flutter - “Positional arguments must occur before named arguments. Try moving all of the positional arguments before the named arguments”错误抖动
我正在尝试使用一个容器来显示TextField，但是该容器不喜欢我的操作顺序。这是我的代码: Widget build(BuildContext context) { return Scaffol
javascript - 未捕获的 SyntaxError : Unexpected eval or arguments in strict mode: window. gtag = (arguments) => dataLayer.push(arguments);
我有以下代码: class MetricGoogleGateway extends AMetricGateway{ constructor(id, name, token) {
javascript - this.argument 和 argument 之间的区别？
我像这样调用下面的对象方法。 new Cout( elem1 ).load( 'body' ) new COut( elem1 ).display( 'email' ) 我一次只使用一个实例。因为我一
c++ - 可变模板函数 : argument number for each argument
我正在尝试使用 C++11 中的可变参数函数模板，并通过如下代码了解了基本思想: void helper() { std::cout void helper( T&& arg ) {
javascript - "arguments"变量从哪里来 "this.callParent(arguments)"？
在学习 ExtJS 4 时，我发现在定义一个新类时，在 initComponent 中方法可以使用 this.callParent(arguments) 调用父类的构造函数. 我想知道这个 argum
swift 4 : Cannot convert value of type '(_) -> ()' to expected argument type '() -> ()' or Argument passed to call that takes no arguments
使用 XCode 9，Beta 3。Swift 4。 statsView.createButton("Button name") { [weak self] Void in //stuff st
javascript - 如果其中一个参数称为 `arguments` ，我可以获得 "arguments"对象吗？
以下代码将打印1: (function (arguments) { console.log(arguments); }(1, 2)); 实际上，arguments 对象已被覆盖。是否可以恢复函
php - 编译错误 : Cannot use positional argument after named argument
/** * @param $name * @return Response * @Route ("/afficheN/{name}",name="afficheN") */ public fu
Scala scopt : argument required() based on one or more other arguments
我习惯使用Scala scopt用于命令行选项解析。您可以选择参数是否为 .required()通过调用刚刚显示的函数。如何定义仅在定义了另一个参数时才需要的参数？例如，我有一个标志 --writ
python - 语法错误 : positional argument follows keyword argument:
所以这是我的代码: def is_valid_move(board, column): '''Returns True if and only if there is an o
python - 我该如何解决SyntaxError : positional argument follows keyword argument
我试图在这里运行此代码: threads = [threading.Thread(name='ThreadNumber{}'.format(n),target=SB, args(shoe_type,m
haskell - 输入 FP : Tuple Arguments and Curriable Arguments
在静态类型函数编程语言(例如 Standard ML、F#、OCaml 和 Haskell)中，编写函数时通常将参数彼此分开，并通过空格与函数名称分开: let add a b = a + b
javascript - 获取被调用者 Function.Arguments 之一的 Function.Arguments
function validateArguments(args) { if(args.length 2) { throw new RangeError("Invalid amo
django - 无反向匹配 : with arguments '()' and keyword arguments
我正在使用 Django 1.5 并尝试将参数传递到我的 URL。当我使用前两个参数时，下面的代码工作正常，使用第三个参数时我收到错误。我已经引用了新的 Django 1.5 更新中的 url 用法，
ember.js - emberjs : What does the . ..arguments in this._super(...arguments) 表示什么？
我刚刚开始使用 ember js 并且多次被这个功能绊倒有人可以简要介绍一下 this._super() 的使用，并解释 ...arguments 的重要性谢谢最佳答案每当您覆盖类/函数(例如
ios - 错误 : Argument passed to call that takes no arguments
这个问题在这里已经有了答案: How to fix an "Argument passed to call that takes no arguments" error? (2 个答案) 关闭 3
ios - 错误 : Argument passed to call that takes no arguments
我正在创建一个简单的登录注册应用程序。但是我遇到了错误，我不知道如何解决，请帮忙!这是我的代码: // // ViewController.swift // CHLogbook-Applicati
Swift 构造函数未出现在方法列表中， "Arguments passed to call that takes no arguments"
我是 Swift 的初学者。我尝试创建一个表示 Meal 的简单类。它有一些属性和一个返回可选的构造函数但是当我尝试测试它或在任何地方实例化它时，我得到的只是一个错误。似乎无法弄清楚发生了什么。
java - Linux 终端 : How to pass an argument to another argument
我有一个在特殊环境下运行其他程序的系统程序: cset shield -e PROGRAM .现在要运行一个 java 程序，我输入了 cset shield -e java PROGRAM ，但这不

首页

博学

6Ren·AI

商城

Python Pandas NLTK 从 Dataframe 'join() argument' 错误的文本字段中提取常用短语 (ngrams)