gpt4 book ai didi

python - 计算单词的频率以及与该单词相关的不同 id 的数量

转载 作者:太空宇宙 更新时间:2023-11-03 18:53:54 25 4
gpt4 key购买 nike

除了计算文档中单词的出现频率之外,我还想计算与该单词关联的不同 id 的数量。用一个例子更容易解释:

from pandas import *
from collections import defaultdict
d = {'ID' : Series(['a', 'a', 'b', 'c', 'c', 'c']),
'words' : Series(["apple banana apple strawberry banana lemon",
"apple", "banana", "banana lemon", "kiwi", "kiwi lemon"])}
df = DataFrame(d)

>>> df
ID words
0 a apple banana apple strawberry banana lemon
1 a apple
2 b banana
3 c banana lemon
4 c kiwi
5 c kiwi lemon

# count frequency of words using defaultdict
wc = defaultdict(int)
for line in df.words:
linesplit = line.split()
for word in linesplit:
wc[word] += 1
# defaultdict(<type 'int'>, {'kiwi': 2, 'strawberry': 1, 'lemon': 3, 'apple': 3, 'banana': 4})
# turn in to a DataFrame
dwc = {"word": Series(wc.keys()),
"count": Series(wc.values())}
dfwc = DataFrame(dwc)
>>> dfwc
count word
0 2 kiwi
1 1 strawberry
2 3 lemon
3 3 apple
4 4 banana

统计词频部分很简单,如上所示。我想要做的是获得如下输出,其中给出与每个单词关联的不同 id 的数量:

   count        word  ids
0 2 kiwi 1
1 1 strawberry 1
2 3 lemon 2
3 3 apple 1
4 4 banana 3

理想情况下,我希望它与计算词频同时进行。但我不确定如何整合它。

任何指针将不胜感激!

最佳答案

我对 Pandas 不太有经验,但你可以做这样的事情。此方法保留一个字典,其中键是单词,值是每个单词出现的所有 ID 的集合。

wc = defaultdict(int)
idc = defaultdict(set)

for ID, words in zip(df.ID, df.words):
lwords = words.split()
for word in lwords:
wc[word] += 1
# You don't really need the if statement (since a set will only hold one
# of each ID at most) but I feel like it makes things much clearer.
if ID not in idc[word]:
idc[word].add(ID)

此 idc 如下所示:

defaultdict(<type 'set'>, {'kiwi': set(['c']), 'strawberry': set(['a']), 'lemon': set(['a', 'c']), 'apple': set(['a']), 'banana': set(['a', 'c', 'b'])})

所以你必须得到每组的长度。我用的是这个:

lenidc = dict((key, len(value)) for key, value in idc.iteritems())

添加 lenidc.values() 作为 dwc 的键并初始化 dfwc 后,我得到:

   count  ids        word
0 2 1 kiwi
1 1 1 strawberry
2 3 2 lemon
3 3 1 apple
4 4 3 banana

这种方法的缺陷是它使用两个独立的字典(wc和idc),并且不能保证它们中的键(单词)具有相同的顺序。因此,您需要将字典合并在一起以消除此问题。我就是这样做的:

# Makes it so the values in the wc dict are a tuple in 
# (word_count, id_count) form
for key, value in lenidc.iteritems():
wc[key] = (wc[key], value)

# Now, when you construct dwc, for count and id you only want to use
# the first and second columns respectively.
dwc = {"word": Series(wc.keys()),
"count": Series([v[0] for v in wc.values()]),
"ids": Series([v[1] for v in wc.values()])}

关于python - 计算单词的频率以及与该单词相关的不同 id 的数量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17705938/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com