python - Counter() 和 most

python - Counter() 和 most_common

转载作者：行者123 更新时间：2023-12-04 08:04:17

28

4

我正在使用 Counter() 来计算 excel 文件中的单词。
我的目标是从文档中获取最常用的单词。
Counter() 无法与我的文件正常工作的问题。
这是代码:

#1. Building a Counter with bag-of-words

import pandas as pd
df = pd.read_excel('combined_file.xlsx', index_col=None)
import nltk

from nltk.tokenize import word_tokenize

# Tokenize the article: tokens
df['tokens'] = df['body'].apply(nltk.word_tokenize)

# Convert the tokens into string values
df_tokens_list = df.tokens.tolist()

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [[string.lower() for string in sublist] for sublist in df_tokens_list]

# Import Counter

from collections import Counter

# Create a Counter with the lowercase tokens: bow_simple

bow_simple = Counter(x for xs in lower_tokens for x in set(xs))

# Print the 10 most common tokens
print(bow_simple.most_common(10))

#2. Text preprocessing practice

# Import WordNetLemmatizer

from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in bow_simple if t.isalpha()]

# Remove all stop words: no_stops 
from nltk.corpus import stopwords

no_stops = [t for t in alpha_only if t not in stopwords.words("english")]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)
print(bow)
# Print the 10 most common tokens
print(bow.most_common(10))

预处理后最常用的词是:

[('dry', 3), ('try', 3), ('clean', 3), ('love', 2), ('one', 2), ('serum', 2), ('eye', 2), ('boot', 2), ('woman', 2), ('cream', 2)]

如果我们在 excel 中手工计算这些单词，则情况并非如此。
你知道我的代码可能有什么问题吗？我将不胜感激在这方面的任何帮助。
该文件的链接在这里:
https://www.dropbox.com/scl/fi/43nu0yf45obbyzprzc86n/combined_file.xlsx?dl=0&rlkey=7j959kz0urjxflf6r536brppt

最佳答案

问题在于bow_simple value 是一个计数器，您可以进一步处理它。这意味着所有项目将只在列表中出现一次，最终结果只是计算在使用 nltk 降低和处理时计数器中出现的单词变体的数量。 .解决方案是创建一个扁平化的词表并将其输入 alpha_only :

# Create a Counter with the lowercase tokens: bow_simple
wordlist = [item for sublist in lower_tokens for item in sublist] #flatten list of lists
bow_simple = Counter(wordlist)

然后在 alpha_only 中使用 wordlist:

alpha_only = [t for t in wordlist if t.isalpha()]

输出:

[('eye', 3617), ('product', 2567), ('cream', 2278), ('skin', 1791), ('good', 1081), ('use', 1006), ('really', 984), ('using', 928), ('feel', 798), ('work', 785)]

关于python - Counter() 和 most_common，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66304912/

28

4

0

文章推荐： flutter - 使用继承的小部件传递小部件键

文章推荐： php - WooCommerce:如何在 Hook 操作中排除特定页面

文章推荐： javascript - 为什么我无法清除我的 setInterval？

python - Counter()+=Counter 和 Counter.update(Counter) 哪个更快？
哪个更快？ Counter()+=Counter 或 Counter.update(Counter)？为什么一个比另一个更快？我尝试了一些简单的分析，但我认为这不足以最终证明 Counter+=C
javascript - counter++ 与 counter = counter+1;
这个问题在这里已经有了答案: ++someVariable vs. someVariable++ in JavaScript (7 个答案) 关闭 7 年前。 var counter = 0; va
ios - counter++/counter-- 没有按预期工作
下面是我正在使用的代码。如果我按 addQuanity m_label 设置显示一个而不是两个。如果我再次按 addWuantity，m_label 显示 2。按 minusQuantity 将 m_
java - 为什么 "false && counter()"中没有调用 counter() ？
这个问题已经有答案了: Does Java evaluate remaining conditions after boolean result is known? (7 个回答) 已关闭 6 年前。
python - 可以让 Counter 不写出 "Counter"吗？
因此，当我将计数器(from collections import Counter)打印到一个文件时，我总是得到它的文字 Counter ({'Foo': 12}) 有没有办法让计数器不那么字面地写出
css - 我可以样式内容 :counter based on counter value?
我正在使用 CSS2.1 计数器将数字应用于棋盘上的人，以实现棋盘游戏，其棋盘图使用 HTML 和 CSS，方法如下: .ply {counter-increment:main;} .move:be
c++ - 在 for 循环中使用++counter 而不是 counter++
这个问题在这里已经有了答案: 关闭 11 年前。 Possible Duplicate: Is there a performance difference between i++ and ++i
c++ - 错误:没有用于调用 Counter::Counter() 的匹配函数
我在尝试编译 Arduino 草图时遇到此错误。我看不出它认为我试图在没有参数的情况下调用 Counter::Counter 的地方。这是怎么回事？ sketch/periodic_effect.cp
PowerShell Get-Counter 命令，-ComputerName 与 -Counter
调用Get-Counter时使用-ComputerName参数和使用-Counter参数中的路径有区别吗？ Get-Counter -Counter "\Memory\Available MB
python - 如何处理名称冲突 collections.Counter 和 typing.Counter？
姓名 Counter在 collections 中都定义了(作为一个类(class))和在 typing (作为通用类型名称)。不幸的是，它们略有不同。处理这个问题的推荐方法是什么？相同点和不同点:
linux - 为什么 ((counter++)) 在 counter == 0 时失败？
此代码不会给出任何失败，但如果您使用 counter++，则第一次迭代会失败。 parameters="one two three" counter=0 for option in $param
date - Powershell 中的 get-counter/export-counter 返回的时间格式错误
powershell 中的 get-counter/export-counter cmdlet 似乎以美国格式返回日期，这在这种情况下是相当不受欢迎的。我浏览了两个 get-help -full 页面
Python 将 Counter 附加到 Counter，就像 Python 字典更新一样
我有 2 个计数器(来自集合的计数器)，我想将一个附加到另一个，而第一个计数器的重叠键将被忽略。喜欢 dic.update (python 词典更新) 例如: from collections imp
unit-testing - 单元测试 -> 无法在此 ChangeNotifierProvider 小部件上方找到正确的 Provider
我想在我的项目中为 Provider ( ChangeNotifierProvider ) 创建一个单元测试，我的单元测试、小部件测试和集成测试成功通过 ✔️，所以现在我尝试(努力尝试🥵...)创建
c - 为什么 counter = counter/2;有 O(log(n))？
我知道以下代码的复杂度为 O(log(n)): while (n>1) { counter++; n/=2; } 我知道在这里，n 在每次迭代中被分成两半，这意味着如果 n 是 100
java - Hadoop 方法 Counter.getName 和 Counter.getDisplayName 之间的区别
Counter.getName() 方法与 Counter.getDisplayName() 方法有什么区别。我没有从文档中看到太多信息 http://hadoop.apache.org/docs/r
python - "Counters from Step 1: No Counters found"使用 Hadoop 和 mrjob
我有一个 python 文件，用于在 Hadoop(版本 2.6.0)上使用 mrjob 来计算二元语法，但我没有得到我希望的输出，而且我在破译终端中的输出时遇到了问题我哪里出错了。我的代码: re
iis - 如何解决错误 "It has taken too long to refresh the W3SVC counters, the stale counters are being used instead"
我看到带有错误消息的事件 ID 2001: It has taken too long to refresh the W3SVC counters , the stale counters are b
javascript - 找不到模块 :Can't resolve './components/counter' in 'C:/...demo\counter-app\src'
我对 React 完全陌生，我正在 YouTube 上学习教程(使用 MOSH 编程)，但我遇到了这个错误，在找到类似问题后无法解决。 index.js import React from 'reac
java - 组织.apache.hadoop.mapreduce.counters.LimitExceededException : Too many counters: 121 max=120
我正在运行一个 hadoop 作业(来自 oozie)，它有几个计数器和多输出。我收到如下错误:org.apache.hadoop.mapreduce.counters.LimitExceededE

首页

博学

6Ren·AI

商城

python - Counter() 和 most_common