gpt4 book ai didi

python - 如何从 pandas 数据框创建词袋

转载 作者:太空狗 更新时间:2023-10-29 21:45:14 24 4
gpt4 key购买 nike

这是我的数据框

    CATEGORY    BRAND
0 Noodle Anak Mas
1 Noodle Anak Mas
2 Noodle Indomie
3 Noodle Indomie
4 Noodle Indomie
23 Noodle Indomie
24 Noodle Mi Telor Cap 3
25 Noodle Mi Telor Cap 3
26 Noodle Pop Mie
27 Noodle Pop Mie
...

我已经确定了df类型是string,我的代码是

df = data[['CATEGORY', 'BRAND']].astype(str)
import collections, re
texts = df
bagsofwords = [ collections.Counter(re.findall(r'\w+', txt))
for txt in texts]
sumbags = sum(bagsofwords, collections.Counter())

当我打电话

sumbags

输出是

 Counter({'BRAND': 1, 'CATEGORY': 1})

我想要 sumbags 中的所有数据计数,除了标题,以使其清晰可见

Counter({'Noodle': 10, 'Indomie': 4, 'Anak': 2, ....}) # because it is bag of words

我需要每一个字数

最佳答案

IIUIC, 使用

选项 1] Numpy flattensplit

In [2535]: collections.Counter([y for x in df.values.flatten() for y in x.split()])
Out[2535]:
Counter({'3': 2,
'Anak': 2,
'Cap': 2,
'Indomie': 4,
'Mas': 2,
'Mi': 2,
'Mie': 2,
'Noodle': 10,
'Pop': 2,
'Telor': 2})

选项 2]使用 value_counts()

In [2536]: pd.Series([y for x in df.values.flatten() for y in x.split()]).value_counts()
Out[2536]:
Noodle 10
Indomie 4
Mie 2
Pop 2
Anak 2
Mi 2
Cap 2
Telor 2
Mas 2
3 2
dtype: int64

选项 3]使用 stackvalue_counts

In [2582]: df.apply(lambda x: x.str.split(expand=True).stack()).stack().value_counts()
Out[2582]:
Noodle 10
Indomie 4
Mie 2
Pop 2
Anak 2
Mi 2
Cap 2
Telor 2
Mas 2
3 2
dtype: int64

详细信息

In [2516]: df
Out[2516]:
CATEGORY BRAND
0 Noodle Anak Mas
1 Noodle Anak Mas
2 Noodle Indomie
3 Noodle Indomie
4 Noodle Indomie
23 Noodle Indomie
24 Noodle Mi Telor Cap 3
25 Noodle Mi Telor Cap 3
26 Noodle Pop Mie
27 Noodle Pop Mie

关于python - 如何从 pandas 数据框创建词袋,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46360435/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com