gpt4 book ai didi

python - Pandas 矩阵计算直到对角线

转载 作者:行者123 更新时间:2023-12-03 13:45:35 28 4
gpt4 key购买 nike

我正在使用python中的 Pandas 进行矩阵计算。
我的原始数据是字符串列表的形式(每行都是唯一的)。

id     list_of_value
0 ['a','b','c']
1 ['d','b','c']
2 ['a','b','c']
3 ['a','b','c']
我必须用一个行对所有其他行进行一个计分
分数计算算法:
Step 1: Take value of id 0: ['a','b','c'],
Step 2: find the intersection between id 0 and id 1 ,
resultant = ['b','c']
Step 3: Score Calculation => resultant.size / id(0).size
对所有ID重复在ID 0和ID 1,2,3之间重复步骤2,3。
创建N * N矩阵:
-  0    1    2  3
0 1 0.6 1 1
1 0.6 1 1 1
2 1 1 1 1
3 1 1 1 1
目前,我正在使用 Pandas 假人方法来计算分数:
s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
s.dot(s.T).div(s.sum(1))
但是在矩阵的对角线之后会重复计算,直到对角线为止的分数计算就足够了。例如:
ID 0的分数的计算仅在ID(row,column)(0,0),ID(row,column)(0,1),(0,2),(0,3)可以计算为从ID(row,column)(1,0),(2,0),(3,0)复制。
详细计算:
matrix sample
我需要计算直到对角线,即 直到黄色框(矩阵的对角线),白色值已经在绿色阴影区域(用于ref)中计算了,我只需要将绿色阴影区域转置为白色的。
我该如何在 Pandas 中做到这一点?

最佳答案

首先,这里是对您的代码进行概要分析。首先将所有命令分开,然后将其发布。

%timeit df.list_of_value.explode()
%timeit pd.get_dummies(s)
%timeit s.sum(level=0)
%timeit s.dot(s.T)
%timeit s.sum(1)
%timeit s2.div(s3)
上面的分析返回了以下结果:
Explode   : 1000 loops, best of 3: 201 µs per loop
Dummies : 1000 loops, best of 3: 697 µs per loop
Sum : 1000 loops, best of 3: 1.36 ms per loop
Dot : 1000 loops, best of 3: 453 µs per loop
Sum2 : 10000 loops, best of 3: 162 µs per loop
Divide : 100 loops, best of 3: 1.81 ms per loop
同时运行两条线会导致:
100 loops, best of 3: 5.35 ms per loop
使用不同的方法较少依赖 Pandas (有时很昂贵)的功能,我创建的代码通过跳过对上三角矩阵和对角线的计算,仅花费了大约三分之一的时间。
import numpy as np

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))
for i in range(len(df)):
d0 = set(df.iloc[i].list_of_value)
d0_len = len(d0)
# the inner loop starts at i+1 because we don't need to calculate the diagonal
for j in range(i + 1, len(df)):
df2[j, i] = len(d0.intersection(df.iloc[j].list_of_value)) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(df))])
使用 df作为
df = pd.DataFrame(
[[['a','b','c']],
[['d','b','c']],
[['a','b','c']],
[['a','b','c']]],
columns = ["list_of_value"])
此代码的性能分析仅导致1.68ms的运行时间。
1000 loops, best of 3: 1.68 ms per loop
更新
无需对整个DataFrame进行操作,只需选择所需的Series即可大大提高速度。
已经测试了三种遍历该系列条目的方法,所有这些方法在性能上都差不多。
%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))

# get the Series from the DataFrame
dfl = df.list_of_value

for i, d0 in enumerate(dfl.values):
# for i, d0 in dfl.iteritems(): # in terms of performance about equal to the line above
# for i in range(len(dfl)): # slightly less performant than enumerate(dfl.values)
d0 = set(d0)
d0_len = len(d0)
# the inner loop starts at i+1 because we don't need to calculate the diagonal
for j in range(i + 1, len(dfl)):
df2[j, i] = len(d0.intersection(dfl.iloc[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])
Pandas 有很多陷阱。例如。始终通过 df.iloc[0]而不是 df[0]访问DataFrame或Series的行。两者都可以,但是 df.iloc[0]更快。
具有4个元素(每个元素的大小为3)的第一个矩阵的时序导致了大约3倍的加速。
1000 loops, best of 3: 443 µs per loop
当使用更大的数据集时,加速比超过11时,我得到了更好的结果:
# operating on the DataFrame
10 loop, best of 3: 565 ms per loop

# operating on the Series
10 loops, best of 3: 47.7 ms per loop
更新2
当完全不使用 Pandas 时(在计算过程中),您将获得另一个显着的加速。因此,您只需要将要转换的列转换为列表即可。
%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])

# convert the column of the DataFrame to a list
dfl = list(df.list_of_value)

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(dfl), len(dfl)))

for i, d0 in enumerate(dfl):
d0 = set(d0)
d0_len = len(d0)
# the inner loop starts at i+1 because we don't need to calculate the diagonal
for j in range(i + 1, len(dfl)):
df2[j, i] = len(d0.intersection(dfl[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])
在问题中提供的数据上,与第一次更新相比,我们只会看到稍微更好的结果。
1000 loops, best of 3: 363 µs per loop
但是,当使用更大的数据(100行,列表大小为15)时,优势显而易见:
100 loops, best of 3: 5.26 ms per loop
这里是所有建议方法的比较:
+----------+-----------------------------------------+
| | Using the Dataset from the question |
+----------+-----------------------------------------+
| Question | 100 loops, best of 3: 4.63 ms per loop |
+----------+-----------------------------------------+
| Answer | 1000 loops, best of 3: 1.59 ms per loop |
+----------+-----------------------------------------+
| Update 1 | 1000 loops, best of 3: 447 µs per loop |
+----------+-----------------------------------------+
| Update 2 | 1000 loops, best of 3: 362 µs per loop |
+----------+-----------------------------------------+

关于python - Pandas 矩阵计算直到对角线,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62552992/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com