gpt4 book ai didi

python - 给 Pandas 切片增值的高效方法

转载 作者:太空宇宙 更新时间:2023-11-04 02:55:00 30 4
gpt4 key购买 nike

我想以一种有效的方式向 pandas 切片添加值,因为这个函数经常被调用。结构如下所示:

import pandas as pd
import numpy as np

names = ["a", "b", "c", "d", "e", "f"]

mat = pd.DataFrame(0.0, index=names, columns=names)

# now comes the `tricky' part
positive_instances = ["a", "e", "c"]
negative_instances = ["d", "b", "f"]

p_mat = np.array([[1.,2.],[3.,4.]])

mat.loc[positive_instances, positive_instances] += p_mat[0,0]
mat.loc[positive_instances, negative_instances] += p_mat[0,1]
mat.loc[negative_instances, positive_instances] += p_mat[1,0]
mat.loc[negative_instances, negative_instances] += p_mat[1,1]

所需的新矩阵 mat 如下所示:

mat = 
a b c d e f
a 1 2 1 2 1 2
b 3 4 3 4 3 4
c 1 2 1 2 1 2
d 3 4 3 4 3 4
e 1 2 1 2 1 2
f 3 4 3 4 3 4

注释下面的结构嵌入到一个 for 循环中。有几种不同的正面和负面实例。添加数据结构:

  • positive_instancesnegative_instances 总是不相交的,不需要相同的长度
  • positive_instancesnegative_instances 的并集总是names
  • positive_instances 始终位于 p_mat 的索引 0 并且 negative_instances 始终位于索引 1

我想有一种更有效的方法可以实现目标。任何帮助将不胜感激。

编辑:更正了代码中的变量名称并添加了所需的输出。

Edit2:添加了关于 positive_instancesnegative_instances 的性质的信息

最佳答案

我们可以在这里使用 NumPy 来有效地将值分配到数组中,使用它的广播索引 np.ix_ ,因此使用 .loc[row,col] 模拟与 Pandas 中相同的行为。完成分配后,我们将创建输出数据框。

因此,实现应该是这样的——

sidx = np.argsort(names)
p_idx = sidx[np.searchsorted(names, positive_instances, sorter= sidx)]
n_idx = sidx[np.searchsorted(names, negative_instances, sorter= sidx)]

n = len(names)
arr = np.zeros((n,n),dtype=p_mat.dtype)
arr[np.ix_(p_idx, p_idx)] = +p_mat[0,0]
arr[np.ix_(p_idx, n_idx)] = +p_mat[0,1]
arr[np.ix_(n_idx, p_idx)] = +p_mat[1,0]
arr[np.ix_(n_idx, n_idx)] = +p_mat[1,1]

df = pd.DataFrame(arr, index=names, columns=names)

运行时测试-

方法:

def func0(p_mat, names, positive_instances, negative_instances):
mat = pd.DataFrame(0.0, index=names, columns=names)

mat.loc[positive_instances, positive_instances] += p_mat[0,0]
mat.loc[positive_instances, negative_instances] += p_mat[0,1]
mat.loc[negative_instances, positive_instances] += p_mat[1,0]
mat.loc[negative_instances, negative_instances] += p_mat[1,1]
return mat

def func1(p_mat, names, positive_instances, negative_instances):
sidx = np.argsort(names)
p_idx = sidx[np.searchsorted(names, positive_instances, sorter= sidx)]
n_idx = sidx[np.searchsorted(names, negative_instances, sorter= sidx)]

n = len(names)
arr = np.zeros((n,n),dtype=p_mat.dtype)
arr[np.ix_(p_idx, p_idx)] = +p_mat[0,0]
arr[np.ix_(p_idx, n_idx)] = +p_mat[0,1]
arr[np.ix_(n_idx, p_idx)] = +p_mat[1,0]
arr[np.ix_(n_idx, n_idx)] = +p_mat[1,1]

df = pd.DataFrame(arr, index=names, columns=names)
return df

时间 -

In [109]: names = ["a", "f", "d","b", "c",  "e"]
...:
...: # now comes the `tricky' part
...: positive_instances = ["a", "e", "c"]
...: negative_instances = ["d", "b", "f"]
...:
...: p_mat = np.array([[1.,2.],[3.,4.]])
...:

In [110]: %timeit func0(p_mat, names, positive_instances, negative_instances)
100 loops, best of 3: 4.87 ms per loop

In [111]: %timeit func1(p_mat, names, positive_instances, negative_instances)
10000 loops, best of 3: 189 µs per loop

In [112]: 4870.0/189
Out[112]: 25.767195767195766

25x+ 在那里加速!

关于python - 给 Pandas 切片增值的高效方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42672856/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com