gpt4 book ai didi

Python生成滚动窗口计算相关性

转载 作者:行者123 更新时间:2023-11-28 22:23:55 25 4
gpt4 key购买 nike

我有一个 Pandas 数据框(97165 行和 2 列),我想为每 100 行计算并保存这些列之间的相关性我想要这样的事情:

第一个相关性 --> 从 0 到 100 的行 --> corr = 0.265

第二次相关 --> 从 1 到 101 的行 --> corr = 0.279

第三个相关性 --> 从 2 到 102 的行 --> corr = 0.287

每个值都必须存储并显示在图中,所以我必须将所有这些值保存在列表或类似的东西中。

我一直在阅读与滚动窗口相关的 pandas 文档 pandas rolling window但我什么也做不了。我试图生成一个简单的循环来获得一些结果,但我遇到了内存问题,我尝试过的代码是:

lcl = 100
a = []
for i in range(len(tabla)):

x = tabla.iloc[i:lcl, [0]]
y = tabla.iloc[i:lcl, [1]]
z = x['2015_Avion'].corr(y['2015_Hotel'])
a.append(z)
lcl += 1

有什么建议吗?

最佳答案

我们可以通过处理数组数据来优化内存和性能。

方法 #1

首先,让我们有一个数组解决方案来获取两个 1D 数组之间对应元素的相关系数。这基本上会受到 this post 的启发。看起来像这样 -

def corrcoeff_1d(A,B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(-1,keepdims=1)
B_mB = B - B.mean(-1,keepdims=1)

# Sum of squares
ssA = np.einsum('i,i->',A_mA, A_mA)
ssB = np.einsum('i,i->',B_mB, B_mB)

# Finally get corr coeff
return np.einsum('i,i->',A_mA,B_mB)/np.sqrt(ssA*ssB)

现在,要使用它,请在数组数据上使用相同的循环 -

lcl = 100
ar = tabla.values
N = len(ar)
out = np.zeros(N)
for i in range(N):
out[i] = corrcoeff_1d(ar[i:i+lcl,0], ar[i:i+lcl,1])

我们可以通过使用卷积预先计算用于计算corrcoeff_1d中的A_mA的滚动平均值来进一步优化性能,但首先让我们排除内存错误。

方法 #2

这是一种几乎向量化的方法,因为我们将向量化大部分迭代,除了末尾没有适当窗口长度的剩余切片。循环计数将从 97165 减少到 lcl-1,即仅 99

lcl = 100
ar = tabla.values
N = len(ar)
out = np.zeros(N)

col0_win = strided_app(ar[:,0],lcl,S=1)
col1_win = strided_app(ar[:,1],lcl,S=1)
vectorized_out = corr2_coeff_rowwise(col0_win, col1_win)
M = len(vectorized_out)
out[:M] = vectorized_out

for i in range(M,N):
out[i] = corrcoeff_1d(ar[i:i+lcl,0], ar[i:i+lcl,1])

辅助函数 -

# https://stackoverflow.com/a/40085052/ @ Divakar
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))

# https://stackoverflow.com/a/41703623/ @Divakar
def corr2_coeff_rowwise(A,B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(-1,keepdims=1)
B_mB = B - B.mean(-1,keepdims=1)

# Sum of squares across rows
ssA = np.einsum('ij,ij->i',A_mA, A_mA)
ssB = np.einsum('ij,ij->i',B_mB, B_mB)

# Finally get corr coeff
return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)

NaN 填充数据的相关性

接下来列出了基于 Pandas 的 NumPy 解决方案,用于计算一维数组和行相关值之间的相关性。

1) 两个一维数组之间的标量相关值-

def nancorrcoeff_1d(A,B):
# Get combined mask
comb_mask = ~(np.isnan(A) & ~np.isnan(B))
count = comb_mask.sum()

# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - np.nansum(A * comb_mask,-1,keepdims=1)/count
B_mB = B - np.nansum(B * comb_mask,-1,keepdims=1)/count

# Replace NaNs with zeros, so that later summations could be computed
A_mA[~comb_mask] = 0
B_mB[~comb_mask] = 0

ssA = np.inner(A_mA,A_mA)
ssB = np.inner(B_mB,B_mB)

# Finally get corr coeff
return np.inner(A_mA,B_mB)/np.sqrt(ssA*ssB)

2) 两个 2D 数组 (m,n) 之间的行相关,给我们一个 1D 形状的数组 (m,) -

def nancorrcoeff_rowwise(A,B):
# Input : Two 2D arrays of same shapes (mxn). Output : One 1D array (m,)
# Get combined mask
comb_mask = ~(np.isnan(A) & ~np.isnan(B))
count = comb_mask.sum(axis=-1,keepdims=1)

# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - np.nansum(A * comb_mask,-1,keepdims=1)/count
B_mB = B - np.nansum(B * comb_mask,-1,keepdims=1)/count

# Replace NaNs with zeros, so that later summations could be computed
A_mA[~comb_mask] = 0
B_mB[~comb_mask] = 0

# Sum of squares across rows
ssA = np.einsum('ij,ij->i',A_mA, A_mA)
ssB = np.einsum('ij,ij->i',B_mB, B_mB)

# Finally get corr coeff
return np.einsum('ij,ij->i',A_mA,B_mB)/np.sqrt(ssA*ssB)

关于Python生成滚动窗口计算相关性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46757318/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com