gpt4 book ai didi

python - 计算 pandas DataFrame 中的子词频率

转载 作者:太空宇宙 更新时间:2023-11-03 15:10:34 24 4
gpt4 key购买 nike

我有一个 pandas.DataFrame,其中有 2 列,其中包含酒精类型(即伏特加 80 PROOF、加拿大威士忌、香料朗姆酒)和售出的瓶子数量。我想首先将其分类为不太精细的类别,即(威士忌、伏特加、朗姆酒),然后将每个类别销售的所有瓶子相加。

我的代码不允许我隔离“VODKA”等标签,而是返回“VODKA 80 Proof”等类别的频率。

在:

top_N = 10 # top 10 most used categories

word_dist = nltk.FreqDist(df['Category Name'])

print('All frequencies:')
print('=' * 60)
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
print('=' * 60)

df= df.groupby('Category Name')['Bottles Sold'].sum()

输出:

All frequencies:
============================================================
Word Frequency
0 VODKA 80 PROOF 35373
1 CANADIAN WHISKIES 27087
2 STRAIGHT BOURBON WHISKIES 15342
3 SPICED RUM 14631
4 VODKA FLAVORED 14001
5 TEQUILA 12109
6 BLENDED WHISKIES 11547
7 WHISKEY LIQUEUR 10902
8 IMPORTED VODKA 10668
9 PUERTO RICO & VIRGIN ISLANDS RUM 10062
============================================================

有什么想法吗?

最佳答案

您是否考虑过添加匹配单词的类别?像这样的东西:

代码:

categories = {'VODKA', 'WHISKIES', 'RUM', 'TEQUILA', 'LIQUEUR'}
df['category'] = df['product'].apply(lambda x:
[c for c in categories if c in x]

测试代码:

data = [
['VODKA 80 PROOF', '35373'],
['CANADIAN WHISKIES', '27087'],
['STRAIGHT BOURBON WHISKIES', '15342'],
['SPICED RUM', '14631'],
['VODKA FLAVORED', '14001'],
['TEQUILA', '12109'],
['BLENDED WHISKIES', '11547'],
['WHISKEY LIQUEUR', '10902'],
['IMPORTED VODKA', '10668'],
['PUERTO RICO & VIRGIN ISLANDS RUM', '10062'],
]
df = pd.DataFrame(data, columns=['product', 'count'], dtype=int)

categories = {'VODKA', 'WHISKIES', 'RUM', 'TEQUILA', 'LIQUEUR'}
df['category'] = df['product'].apply(lambda x:
[c for c in categories if c in x][0])
print(df)
print(df.groupby('category')['count'].sum())

结果:

                            product  count  category
0 VODKA 80 PROOF 35373 VODKA
1 CANADIAN WHISKIES 27087 WHISKIES
2 STRAIGHT BOURBON WHISKIES 15342 WHISKIES
3 SPICED RUM 14631 RUM
4 VODKA FLAVORED 14001 VODKA
5 TEQUILA 12109 TEQUILA
6 BLENDED WHISKIES 11547 WHISKIES
7 WHISKEY LIQUEUR 10902 LIQUEUR
8 IMPORTED VODKA 10668 VODKA
9 PUERTO RICO & VIRGIN ISLANDS RUM 10062 RUM

category
LIQUEUR 10902
RUM 24693
TEQUILA 12109
VODKA 60042
WHISKIES 53976
Name: count, dtype: int32

关于python - 计算 pandas DataFrame 中的子词频率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44232190/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com