python - Pandas 矩阵计算直到对角线-6ren

python - Pandas 矩阵计算直到对角线

转载作者：行者123 更新时间：2023-12-03 13:45:35

28

4

我正在使用python中的 Pandas 进行矩阵计算。
我的原始数据是字符串列表的形式(每行都是唯一的)。

id     list_of_value
0      ['a','b','c']
1      ['d','b','c']
2      ['a','b','c']
3      ['a','b','c']

我必须用一个行对所有其他行进行一个计分
分数计算算法:

Step 1: Take value of id 0: ['a','b','c'],
Step 2: find the intersection between id 0 and id 1 , 
        resultant = ['b','c']
Step 3: Score Calculation => resultant.size / id(0).size

对所有ID重复在ID 0和ID 1,2,3之间重复步骤2,3。
创建N * N矩阵:

-  0    1    2  3
0  1    0.6  1  1
1  0.6  1    1  1 
2  1    1    1  1
3  1    1    1  1

目前，我正在使用 Pandas 假人方法来计算分数:

s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
s.dot(s.T).div(s.sum(1))

但是在矩阵的对角线之后会重复计算，直到对角线为止的分数计算就足够了。例如:
ID 0的分数的计算仅在ID(row，column)(0,0)，ID(row，column)(0,1)，(0,2)，(0,3)可以计算为从ID(row，column)(1,0)，(2,0)，(3,0)复制。
详细计算:

我需要计算直到对角线，即直到黄色框(矩阵的对角线)，白色值已经在绿色阴影区域(用于ref)中计算了，我只需要将绿色阴影区域转置为白色的。
我该如何在 Pandas 中做到这一点？

最佳答案

首先，这里是对您的代码进行概要分析。首先将所有命令分开，然后将其发布。

%timeit df.list_of_value.explode()
%timeit pd.get_dummies(s)
%timeit s.sum(level=0)
%timeit s.dot(s.T)
%timeit s.sum(1)
%timeit s2.div(s3)

上面的分析返回了以下结果:

Explode   : 1000 loops, best of 3: 201 µs per loop
Dummies   : 1000 loops, best of 3: 697 µs per loop
Sum       : 1000 loops, best of 3: 1.36 ms per loop
Dot       : 1000 loops, best of 3: 453 µs per loop
Sum2      : 10000 loops, best of 3: 162 µs per loop
Divide    : 100 loops, best of 3: 1.81 ms per loop

同时运行两条线会导致:

100 loops, best of 3: 5.35 ms per loop

使用不同的方法较少依赖 Pandas (有时很昂贵)的功能，我创建的代码通过跳过对上三角矩阵和对角线的计算，仅花费了大约三分之一的时间。

import numpy as np

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))
for i in range(len(df)):
    d0 = set(df.iloc[i].list_of_value)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(df)):
        df2[j, i] = len(d0.intersection(df.iloc[j].list_of_value)) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(df))])

使用 df作为

df = pd.DataFrame(
    [[['a','b','c']],
     [['d','b','c']],
     [['a','b','c']],
     [['a','b','c']]],
     columns = ["list_of_value"])

此代码的性能分析仅导致1.68ms的运行时间。

1000 loops, best of 3: 1.68 ms per loop

更新
无需对整个DataFrame进行操作，只需选择所需的Series即可大大提高速度。
已经测试了三种遍历该系列条目的方法，所有这些方法在性能上都差不多。

%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))

# get the Series from the DataFrame
dfl = df.list_of_value

for i, d0 in enumerate(dfl.values):
# for i, d0 in dfl.iteritems():  # in terms of performance about equal to the line above
# for i in range(len(dfl)): # slightly less performant than enumerate(dfl.values)
    d0 = set(d0)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(dfl)):
        df2[j, i] = len(d0.intersection(dfl.iloc[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])

Pandas 有很多陷阱。例如。始终通过 df.iloc[0]而不是 df[0]访问DataFrame或Series的行。两者都可以，但是 df.iloc[0]更快。
具有4个元素(每个元素的大小为3)的第一个矩阵的时序导致了大约3倍的加速。

1000 loops, best of 3: 443 µs per loop

当使用更大的数据集时，加速比超过11时，我得到了更好的结果:

# operating on the DataFrame
10 loop, best of 3: 565 ms per loop

# operating on the Series
10 loops, best of 3: 47.7 ms per loop

更新2
当完全不使用 Pandas 时(在计算过程中)，您将获得另一个显着的加速。因此，您只需要将要转换的列转换为列表即可。

%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])

# convert the column of the DataFrame to a list
dfl = list(df.list_of_value)

# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(dfl), len(dfl)))

for i, d0 in enumerate(dfl):
    d0 = set(d0)
    d0_len = len(d0)
    # the inner loop starts at i+1 because we don't need to calculate the diagonal
    for j in range(i + 1, len(dfl)):
        df2[j, i] = len(d0.intersection(dfl[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])

在问题中提供的数据上，与第一次更新相比，我们只会看到稍微更好的结果。

1000 loops, best of 3: 363 µs per loop

但是，当使用更大的数据(100行，列表大小为15)时，优势显而易见:

100 loops, best of 3: 5.26 ms per loop

这里是所有建议方法的比较:

+----------+-----------------------------------------+
|          | Using the Dataset from the question     |
+----------+-----------------------------------------+
| Question | 100 loops, best of 3: 4.63 ms per loop  |
+----------+-----------------------------------------+
| Answer   | 1000 loops, best of 3: 1.59 ms per loop |
+----------+-----------------------------------------+
| Update 1 | 1000 loops, best of 3: 447 µs per loop  |
+----------+-----------------------------------------+
| Update 2 | 1000 loops, best of 3: 362 µs per loop  |
+----------+-----------------------------------------+

关于python - Pandas 矩阵计算直到对角线，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62552992/

28

4

0

文章推荐： python - 预提交安装在哪里 "environments"？

文章推荐： reactjs - 无法识别的事件React Native Redux

文章推荐： php - VSCode Prettier不格式化PHP

java - 循环遍历二维数组(对角线)？
我有一个 6x6 数组，并且希望始终获取接下来的四个值。举个例子: 0----- -1---- --2--- ---3-- ----4- 所以我想得到所有对角线的 (0+1+2+3) 和 (1+2+3
c - 遍历转换为一维的二维数组，对角线
我想遍历一个已转换为一维的二维方阵。问题是我想遍历它，就像我在对角条中遍历原始 2D 一样。该数组是对角数组，我最初使用一维的 malloc 创建它，以避免分配太多内存。数组的大小: int T
java - 在二维矩阵中查找特定单词(对角线)
“给定一个 2D 字符数组和一个字符串。查找特定字符串是否出现在矩阵的对角线上。 private static boolean diagonalContains(char[][] grid,Stri
matlab - 从矩阵中提取 block 对角线
我有一个由 nxn 矩阵组成的 njxnj 矩阵。我想提取 nxn 矩阵的对角 j block 。即我想提取对角线(对于 n = 2，j = 4): 最有效的方法是什么？最佳答案要为元素建立索引，
python - Concat DataFrames 对角线
这是一个 self 回答的问题。给定两个数据框， x 0 1 0 1 2 1 3 4 y 0 1 2 0 5 6 7 1 8 9 X 2 Y Z 0 x 和
c - 如何从矩阵打印行/列/对角线
我试图让我的程序打印出不等于幻方规则的行、列或对角线，例如，如果矩阵是 1 9 5 2 4 3 6 8 7 第 1 行 [2, 4, 3] 不起作用第 2 行 [6, 8, 7] 不起作用第 0
python - 对角线(之字形)遍历坐标的索引
所以我有一个像这样的 4x4 矩阵 |0 1 2 3 -+------- 0|0 1 3 6 1|2 4 7 a 2|5 8 b d 3|9 c e f 并且我是按照其中的十六进制字符指定的顺序遍历
python - pandas DataFrame 对角线
什么是获取正方形DataFrame的对角线的有效方法。我希望结果是一个 Series 和一个 MultiIndex 有两个级别，第一个是 DataFrame 的索引，第二个级别是DataFrame 的
io - 矩形中的 SwiftUI 对角线 LinearGradient
问题:我正在尝试在 SwiftUI 中以矩形呈现对角线线性渐变。我实现了一个标准的多点线性渐变，它在呈现为正方形时效果很好，但是当我将框架更改为矩形时，它有一些奇怪的行为，看起来更水平，或者有一些奇
c# - Excel 单元格中的 VSTO 对角线
我目前正在尝试找到一种在 C# for Excel 中使用 VSTO 的方法，以使用 C# 代码在单元格中绘制对角线。但我在网上找不到任何人甚至试图这样做。有谁知道如何实现这一目标？谢谢 (对不起
image - 如何从图像中删除所有线条？ (水平、垂直、对角线)
我需要删除图像中的线条，这最终是一个表格。我找到了一种删除水平线和垂直线的方法: convert 1.jpg -type Grayscale -negate -define morphology:co
python - 为什么它打印两倍的 "big"对角线(矩阵)
我有一个如下所示的矩阵: ` matrix = [ ['P', 'o', 'P', 'o', 'P'], ['m', 'i', 'c', 's', 'r'], ['g', 'a', 'T', 'A',
python - 堆叠矩阵以创建一个矩阵，其中父矩阵映射的位点作为 block 对角线
如何在Python中按如下方式堆叠矩阵，使得父矩阵的元素在子矩阵的相同 block 对角点处形成 block 对角线。例子:我有四个矩阵 AA,AB,BA,BB 我想制作如附图所示的矩阵。最佳答案
python - 如何将所选元素与行中的总和值相除，对角线 PANDAS 上的值除外
我在做一些统计。我有数据框: tag a b c d e f a 5 2 3 2 0 1 b 2 4 3 2 0 1
Java: Connect 4 Winning 对角线
我最近做了一个 Connect4 游戏，当我的 Connect4 向右对角线连接时，我的 Connect4 没有赢得游戏。并且它只适用于某些组合，当它连接到左边的对角线时。坐标:- 左上角:(0,0)
python - Numpy 修改 ndarray 对角线
在 numpy 中有什么方法可以获取对角数组的引用吗？我希望我的数组对角线除以某个因子谢谢最佳答案如果 X 是你的数组，c 是因子， X[np.diag_indices_from(X)] /= c
c - 在二维矩阵(水平、垂直、对角线)中查找回文
关闭。这个问题需要多问focused 。目前不接受答案。想要改进此问题吗？更新问题，使其仅关注一个问题 editing this post . 已关闭 6 年前。 Improve this ques
python - 从 block 对角线 PyTorch 张量中提取 block
我有一个形状为 (m*n, m*n) 的张量，我想提取一个大小为 (n, m*n) 的张量，其中包含对角线上大小为 n*n 的 m 个块。例如: >>> a tensor([[1, 2, 0, 0],
python - Matplotlib:在 3d 图中突出显示 2d 对角线
我目前正在使用 matplotlib/pyplot 绘制 3d 对象，如下所示: fig = plt.figure().gca(projection='3d') plot = fig.plot_sur
c++ - "Isolate"来自 64 位数字的特定行/列/对角线
好的，让我们考虑一个 64 位的数字，它的位组成一个 8x8 的表。例如 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 1 1 0 1 0 1

首页

博学

6Ren·AI

商城

python - Pandas 矩阵计算直到对角线