gpt4 book ai didi

python - 从 pandas df 中的列创建一个二元组

转载 作者:太空宇宙 更新时间:2023-11-04 00:44:02 26 4
gpt4 key购买 nike

我在 pandas dataframe 中有这个测试表

   Leaf_category_id  session_id  product_id
0 111 1 987
3 111 4 987
4 111 1 741
1 222 2 654
2 333 3 321

这是我之前问题的延伸,@jazrael 回答了这个问题。 view answer

所以在获得 product_id 列中的值之后(只是一个假设,与我之前问题的输出略有不同,

   |product_id               |
---------------------------
|111,987,741,34,12 |
|987,1232 |
|654,12,324,465,342,324 |
|321,741,987 |
|324,654,862,467,243,754 |
|6453,123,987,741,34,12 |

等等,我想创建一个新列,其中一行中的所有值都应该与下一个和最后一个一起作为一个二元组。在行中与第一个组合,例如:

   |product_id               |Bigram
-------------------------------------------------------------------------
|111,987,741,34,12 |(111,987),**(987,741)**,(741,34),(34,12),(12,111)
|987,1232 |(987,1232),(1232,987)
|654,12,324,465,342,32 |(654,12),(12,324),(324,465),(465,342),(342,32),(32,654)
|321,741,987 |(321,741),**(741,987)**,(987,321)
|324,654,862 |(324,654),(654,862),(862,324)
|123,987,741,34,12 |(123,987),(987,741),(34,12),(12,123)

忽略**(稍后我会告诉你我为什么加星标)

实现二元组的代码是

for i in df.Leaf_category_id.unique(): 
print (df[df.Leaf_category_id == i].groupby('session_id')['product_id'].apply(lambda x: list(zip(x, x[1:]))).reset_index())

从这个 df,我想考虑 bigram 列并再创建一个名为 frequency 的列,它给出了 bigram 出现的频率。

Note* : (987,741) and (741,987) are to be considered as same and one dublicate entry should be removed and thus frequency of (987,741) should be 2. similar is the case with (34,12) it occurs two times, so frequency should be 2

   |Bigram
---------------
|(111,987),
|**(987,741)**
|(741,34)
|(34,12)
|(12,111)
|**(741,987)**
|(987,321)
|(34,12)
|(12,123)

最终结果应该是。

   |Bigram       | frequency |
--------------------------
|(111,987) | 1
|(987,741) | 2
|(741,34) | 1
|(34,12) | 2
|(12,111) | 1
|(987,321) | 1
|(12,123) | 1

我希望在这里找到答案,请帮助我,我已经尽可能详细地阐述了。

最佳答案

试试这段代码

from itertools import combinations
import pandas as pd

df = pd.DataFrame.from_csv("data.csv")
#consecutive
grouped_consecutive_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in zip(x,x[1:])]).reset_index()

df1=pd.DataFrame(grouped_consecutive_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_consecutive = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_consecutive["index"]

对于组合(所有可能的二元组)

from itertools import combinations
import pandas as pd

df = pd.DataFrame.from_csv("data.csv")
#combinations
grouped_combination_product_ids = df.groupby(['Leaf_category_id','session_id'])['product_id'].apply(lambda x: [tuple(sorted(pair)) for pair in combinations(x,2)]).reset_index()

df1=pd.DataFrame(grouped_combination_product_ids)
s=df1.product_id.apply(lambda x: pd.Series(x)).unstack()
df2=pd.DataFrame(s.reset_index(level=0,drop=True)).dropna()
df2.rename(columns = {0:'Bigram'}, inplace = True)
df2["freq"] = df2.groupby('Bigram')['Bigram'].transform('count')
bigram_frequency_combinations = df2.drop_duplicates(keep="first").sort_values("Bigram").reset_index()
del bigram_frequency_combinations["index"]

其中 data.csv 包含

Leaf_category_id,session_id,product_id
0,111,1,111
3,111,4,987
4,111,1,741
1,222,2,654
2,333,3,321
5,111,1,87
6,111,1,34
7,111,1,12
8,111,1,987
9,111,4,1232
10,222,2,12
11,222,2,324
12,222,2,465
13,222,2,342
14,222,2,32
15,333,3,321
16,333,3,741
17,333,3,987
18,333,3,324
19,333,3,654
20,333,3,862
21,222,1,123
22,222,1,987
23,222,1,741
24,222,1,34
25,222,1,12

结果 bigram_frequency_consecutive 将是

         Bigram  freq
0 (12, 34) 2
1 (12, 324) 1
2 (12, 654) 1
3 (12, 987) 1
4 (32, 342) 1
5 (34, 87) 1
6 (34, 741) 1
7 (87, 741) 1
8 (111, 741) 1
9 (123, 987) 1
10 (321, 321) 1
11 (321, 741) 1
12 (324, 465) 1
13 (324, 654) 1
14 (324, 987) 1
15 (342, 465) 1
16 (654, 862) 1
17 (741, 987) 2
18 (987, 1232) 1

结果 bigram_frequency_combinations 将是

           Bigram  freq
0 (12, 32) 1
1 (12, 34) 2
2 (12, 87) 1
3 (12, 111) 1
4 (12, 123) 1
5 (12, 324) 1
6 (12, 342) 1
7 (12, 465) 1
8 (12, 654) 1
9 (12, 741) 2
10 (12, 987) 2
11 (32, 324) 1
12 (32, 342) 1
13 (32, 465) 1
14 (32, 654) 1
15 (34, 87) 1
16 (34, 111) 1
17 (34, 123) 1
18 (34, 741) 2
19 (34, 987) 2
20 (87, 111) 1
21 (87, 741) 1
22 (87, 987) 1
23 (111, 741) 1
24 (111, 987) 1
25 (123, 741) 1
26 (123, 987) 1
27 (321, 321) 1
28 (321, 324) 2
29 (321, 654) 2
30 (321, 741) 2
31 (321, 862) 2
32 (321, 987) 2
33 (324, 342) 1
34 (324, 465) 1
35 (324, 654) 2
36 (324, 741) 1
37 (324, 862) 1
38 (324, 987) 1
39 (342, 465) 1
40 (342, 654) 1
41 (465, 654) 1
42 (654, 741) 1
43 (654, 862) 1
44 (654, 987) 1
45 (741, 862) 1
46 (741, 987) 3
47 (862, 987) 1
48 (987, 1232) 1

在上面的例子中,它同时分组

关于python - 从 pandas df 中的列创建一个二元组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40594210/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com