gpt4 book ai didi

python - Pandas 将索引值与相应的索引值进行比较以找到百分比匹配

转载 作者:行者123 更新时间:2023-12-01 00:35:56 25 4
gpt4 key购买 nike

我正在尝试将与索引关联的值与与其他索引关联的值进行比较,并得出百分比匹配。

我有下表:

 ColumnA    ColumnB
TestA A
TestA B
TestA C
TestA D
TestB D
TestB E
TestC C
TestC B
TestC E
TestD A


Index TestA has values A,B,C,D when compared to Index B which has values D,E we can see only 1 value matches out of possible 5(A,B,C,D,E). Hence match in 20%.

Index TestA has values A,B,C,D when compared to Index C which has values C,B,E we can see only 2 value matches out of possible 5(A,B,C,D,E). Hence match in 40%.

Index TestA has values A,B,C,D when compared to Index D which has values A we can see only 1 value matches out of possible 4(A,B,C,D). Hence match in 25%.

Index TestB has values D,E when compared to Index A which has values A,B,C,D we can see only 1 value matches out of possible 5(A,B,C,D,E). Hence match in 20%.

Index TestB has values D,E when compared to Index C which has values C,B,E we can see only 1 value matches out of possible 1(B,C,D,E). Hence match in 25%.

...等等...

想法是以矩阵格式显示数据:

       TestA    TestB   TestC   TestD
TestA 100 20 40 25
TestB 20 100 25 0
TestC 40 25 100 0
TestD 25 0 0 100

我编写的基本代码是迭代值。

import pandas as pd
from pyexcelerate import Workbook
import numpy as np
import time
start = time.process_time()
excel_file = 'Test.xlsx'
df = pd.read_excel(excel_file, sheet_name=1, index_col=0)
mylist = list(set(df.index.get_values()))
mylist.sort()
for i in mylist:
for j in mylist:
L1 = df.loc[i].get_values()
L2 = df.loc[j].get_values()
L3 = []
print(i,j)
for m in L1:
if not m in L3:
L3.append(m)
for n in L2:
if not n in L3:
L3.append(n)
L3.sort()
if i == j:
print(len(L1)/len(L3)*100)
else:
n = 0
for k in L1:
for l in L2:
if k == l:
n = n+1
print(n/len(L3)*100)
print(time.process_time() - start)

如何从这里计算百分比并以我希望显示的矩阵格式显示数据。

EDIT1:更新了代码,因为我现在可以计算百分比了。我正在寻找一种以矩阵格式打印这些数据的方法。

EDIT2:原始数据集在 A 列中约有 10k 个奇数唯一条目,在 B 列中约有 15K 个奇数唯一条目。工作表中的总行数约为 40 行。不确定这是否有影响。只是认为它会提供一些背景。

最佳答案

可以使用itertools计算所有唯一的Col A的乘积,然后计算pct并构建新的df:

from itertools import product

# for each unique element in colA, build a list of unique elements from ColB
g = (
df.groupby('ColumnA').ColumnB
.apply(lambda x: x.values.tolist())
)

# generate a combination of all the lists
prod = list(product(g, repeat=2))

data = (
#for each pair of lists, find the number of common elements,
#then divide by the union of 2 lists. This gives you the pct.
np.array([len(set(e[0]).intersection(e[1]))/len(set(e[0]).union(e[1])) for e in prod])
.reshape(len(g), -1)
)

pd.DataFrame(data*100, index=g.index.tolist(), columns=g.index.tolist())

TestA TestB TestC TestD
TestA 100.0 20.0 40.0 25.0
TestB 20.0 100.0 25.0 0.0
TestC 40.0 25.0 100.0 0.0
TestD 25.0 0.0 0.0 100.0

关于python - Pandas 将索引值与相应的索引值进行比较以找到百分比匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57764494/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com