python - Pandas 获得最高点积的索引-6ren

python - Pandas 获得最高点积的索引

转载作者：太空宇宙更新时间：2023-11-03 15:00:31

我有一个这样的数据框:

df1 = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,10,11,12]})
    a   b   c
0   1   5   9
1   2   6   10
2   3   7   11
3   4   8   12

而且我想在此数据框中创建另一列，用于存储每一行，当对其执行点积时，另一行获得最高分。

例如，对于第一行，我们将针对其他行计算点积:

df1.drop(0).dot(df1.loc[0]).idxmax()
output: 3

因此我可以创建一个函数:

def get_highest(dataframe):
    lis = []
    for row in dataframe.index:
        temp = dataframe.drop(row).dot(dataframe.loc[row])
        lis.append(temp.idxmax())
    return lis

然后我得到了我想要的:

df1['highest'] = get_highest(df1)
output: 
    a   b   c   highest
0   1   5   9   3
1   2   6   10  3
2   3   7   11  3
3   4   8   12  2

好的，这行得通，但问题是它根本无法扩展。以下是 timeit 对不同行数的输出:

4 rows: 2.87 ms
40 rows: 77.1 ms
400 rows: 700 ms
4000 rows: 10.4s

我必须在大约有 24 万行和 3.3 千列的数据帧上执行此操作。因此我的问题是:有没有办法优化这个计算？ (可能通过另一种方式解决)

提前谢谢你。

最佳答案

用转置进行矩阵乘法:

mat_mul = np.dot(df.values, df.values.T)

用小数字填充对角线，这样它们就不会是最大值(我假设所有正数，所以填充 -1 但你可以改变它):

np.fill_diagonal(mat_mul, -1)

现在获取数组的 argmax:

df['highest'] = mat_mul.argmax(axis=1)

10k x 4 df 的计时:

%%timeit
mat_mul = np.dot(df.values, df.values.T)
np.fill_diagonal(mat_mul, -1)
df['highest'] = mat_mul.argmax(axis=1)

1 loop, best of 3: 782 ms per loop

%timeit df['highest'] = get_highest(df)
1 loop, best of 3: 9.8 s per loop

关于python - Pandas 获得最高点积的索引，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38354213/