gpt4 book ai didi

python-3.x - pandas Dataframe 中最受欢迎的单词数

转载 作者:行者123 更新时间:2023-12-04 20:00:35 26 4
gpt4 key购买 nike

我使用包含电影数据的 csv 数据文件。在这个数据集中有一个名为 plot_keywords 的列。我想找到 10 或 20 个最流行的关键字,它们出现的次数并将它们绘制在条形图中。更具体地说,我复制了 2 个实例,因为它们出现在我打印数据框

9血|书|爱情|药水|教授

18 黑 mustache |船长|海盗|复仇|士兵

我将 csv 文件作为 pandas DataFrame 打开。这是我到目前为止的代码

import pandas as pd
data=pd.read_csv('data.csv')
pd.Series(' '.join(data['plot_keywords']).lower().split()).value_counts()[:10]

到目前为止,没有其他帖子对我有帮助
提前致谢

https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset/kernels

最佳答案

这是 NLTK 解决方案 ,忽略英语停用词(例如: inonofthe 等):

import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import nltk

top_N = 10

df = pd.read_csv(r'/path/to/imdb-5000-movie-dataset.zip',
usecols=['movie_title','plot_keywords'])

txt = df.plot_keywords.str.lower().str.replace(r'\|', ' ').str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)

stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)

print('All frequencies, including STOPWORDS:')
print('=' * 60)
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
print('=' * 60)

rslt = pd.DataFrame(words_except_stop_dist.most_common(top_N),
columns=['Word', 'Frequency']).set_index('Word')

matplotlib.style.use('ggplot')

rslt.plot.bar(rot=0)

输出:
All frequencies, including STOPWORDS:
============================================================
Word Frequency
0 in 339
1 female 301
2 title 289
3 nudity 259
4 love 248
5 on 240
6 school 238
7 friend 228
8 of 222
9 the 212
============================================================

enter image description here

Pandas 解决方案 ,它使用来自 NLTK 模块的停用词:
from collections import Counter
import pandas as pd
import nltk

top_N = 10

df = pd.read_csv(r'/path/to/imdb-5000-movie-dataset.zip',
usecols=['movie_title','plot_keywords'])

stopwords = nltk.corpus.stopwords.words('english')
# RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
# replace '|'-->' ' and drop all stopwords
words = (df.plot_keywords
.str.lower()
.replace([r'\|', RE_stopwords], [' ', ''], regex=True)
.str.cat(sep=' ')
.split()
)

# generate DF out of Counter
rslt = pd.DataFrame(Counter(words).most_common(top_N),
columns=['Word', 'Frequency']).set_index('Word')
print(rslt)

# plot
rslt.plot.bar(rot=0, figsize=(16,10), width=0.8)

输出:
        Frequency
Word
female 301
title 289
nudity 259
love 248
school 238
friend 228
police 210
male 205
death 195
sex 192

关于python-3.x - pandas Dataframe 中最受欢迎的单词数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40206249/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com