gpt4 book ai didi

python - 在 pandas 或 python 中逐组比较 2 列

转载 作者:行者123 更新时间:2023-12-03 07:58:25 25 4
gpt4 key购买 nike

我目前这里有一个数据集,我不确定如何比较各组是否具有相似的值。这是我的数据集的示例

type   value
a 1
a 2
a 3
a 4

b 2
b 3
b 4
b 5

c 1
c 3
c 4



d 2
d 3
d 4


我想知道哪些行是相似的,因为所有(一种类型中的值)都存在于另一种类型中。例如,类型 d 的值为 2,3,4,类型 a 的值为 1,2,3,4所以这是“相似”或者可以被认为是相同的,所以我希望它输出一些东西来告诉我 d 与 A 相似。

预期的输出应该是这样的


type value similarity
a 1 A is similar to B and D
a 2
a 3
a 4

b 2 b is similar to a and d
b 3
b 4
b 5

c 1 c is similar to a
c 3
c 4



d 2 d is similar to a and b
d 3
d 4


不确定这是否可以在 python 或 pandas 中完成,但非常感谢指导,因为我真的迷路了,不知道从哪里开始

输出也不必是我刚才作为示例的内容,它可以只是另一个 csv,告诉我哪些类型是相似的并且

最佳答案

我会使用集合运算。

假设相似性意味着至少有 N 个共同点:

from itertools import combinations

# define minimum number of common items
N = 3

# aggregate as sets
s = df.groupby('type')['value'].agg(set)

# generate all combinations of sets
# and check is the intersection is at least N items
out = (pd.Series([len(a&b)>=N for a, b in combinations(s, 2)],
index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
)

# concat and add the reversed combinations (a/b -> b/a)
# we could have used a product in the first part but this
# would have required performing the computations twice
similarity = (
pd.concat([out, out.swaplevel()])
.loc[lambda x: x].reset_index(-1)
.groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)

# update the first row of each group with the string
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)

print(df)

输出:

   type  value               similarity
0 a 1 a is similar to b, c, d
1 a 2 NaN
2 a 3 NaN
3 a 4 NaN
4 b 2 b is similar to d, a
5 b 3 NaN
6 b 4 NaN
7 b 5 NaN
8 c 1 c is similar to a
9 c 3 NaN
10 c 4 NaN
11 d 2 d is similar to a, b
12 d 3 NaN
13 d 4 NaN

假设相似性意味着一组是另一组的子集:

from itertools import combinations

s = df.groupby('type')['value'].agg(set)

out = (pd.Series([a.issubset(b) or b.issubset(a) for a, b in combinations(s, 2)],
index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
)

similarity = (
pd.concat([out, out.swaplevel()])
.loc[lambda x: x].reset_index(-1)
.groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)

df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)

print(df)

输出:

   type  value            similarity
0 a 1 a is similar to c, d
1 a 2 NaN
2 a 3 NaN
3 a 4 NaN
4 b 2 b is similar to d
5 b 3 NaN
6 b 4 NaN
7 b 5 NaN
8 c 1 c is similar to a
9 c 3 NaN
10 c 4 NaN
11 d 2 d is similar to a, b
12 d 3 NaN
13 d 4 NaN

关于python - 在 pandas 或 python 中逐组比较 2 列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75308744/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com