gpt4 book ai didi

python - 计算 xarray 中缺失数据的相关性

转载 作者:行者123 更新时间:2023-12-04 13:17:42 26 4
gpt4 key购买 nike

我正在尝试沿时间维度计算 xarray 中两个数据集之间的相关性。我的数据集都是经纬度 x 时间。我的一个数据集缺失了足够多的数据,这对于插值和消除间隙是不合理的,相反我只想忽略缺失值。我有一些简单的代码可以正常工作,但没有一个适合我的确切用例。例如:

def covariance(x,y,dims=None):
return xr.dot(x-x.mean(dims), y-y.mean(dims), dims=dims) / x.count(dims)

def correlation(x,y,dims=None):
return covariance(x,y,dims) / (x.std(dims) * y.std(dims))

如果没有数据丢失,效果很好,但当然不能与 nans 一起使用。虽然有一个很好的例子written for xarray here ,即使使用这段代码,我也在努力计算 PIL 逊相关性而不是斯 PIL 曼相关性。

import numpy as np
import xarray as xr
import bottleneck

def covariance_gufunc(x, y):
return ((x - x.mean(axis=-1, keepdims=True))
* (y - y.mean(axis=-1, keepdims=True))).mean(axis=-1)

def pearson_correlation_gufunc(x, y):
return covariance_gufunc(x, y) / (x.std(axis=-1) * y.std(axis=-1))

def spearman_correlation_gufunc(x, y):
x_ranks = bottleneck.rankdata(x, axis=-1)
y_ranks = bottleneck.rankdata(y, axis=-1)
return pearson_correlation_gufunc(x_ranks, y_ranks)

def spearman_correlation(x, y, dim):
return xr.apply_ufunc(
spearman_correlation_gufunc, x, y,
input_core_dims=[[dim], [dim]],
dask='parallelized',
output_dtypes=[float])

终于有一个useful discussion on github将此作为功能添加到 xarray,但尚未实现。有没有一种有效的方法可以在存在数据缺口的数据集上执行此操作?

最佳答案

我一直在关注这个 Github 讨论以及随后实现 .corr() 方法的尝试,看起来我们已经很接近了,但还没有实现。

与此同时,大多数人试图合并的基本代码在另一个答案 (How to apply linear regression to every pixel in a large multi-dimensional array containing NaNs?) 中得到了很好的概述。这是一个很好的解决方案,它利用 NumPy 中的矢量化操作,并且可以通过一些小的调整(请参阅链接中接受的答案)来解释时间轴上的 NaN。

def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time.
Thus the input data could be a 1D time series, or for example, have three
dimensions (time,lat,lon).
Datasets can be provided in any order, but note that the regression slope
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value,
and standard error on regression between the two datasets along their
aligned time dimension.
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount.
"""
#1. Ensure that the data are properly alinged to each other.
x,y = xr.align(x,y)

#2. Add lag information if any, and shift the data accordingly
if lagx!=0:

# If x lags y by 1, x must be shifted 1 step backwards.
# But as the 'zero-th' value is nonexistant, xr assigns it as invalid
# (nan). Hence it needs to be dropped
x = x.shift(time = -lagx).dropna(dim='time')

# Next important step is to re-align the two datasets so that y adjusts
# to the changed coordinates of x
x,y = xr.align(x,y)

if lagy!=0:
y = y.shift(time = -lagy).dropna(dim='time')
x,y = xr.align(x,y)

#3. Compute data length, mean and standard deviation along time axis:
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd = x.std(axis=0)
ystd = y.std(axis=0)

#4. Compute covariance along time axis
cov = np.sum((x - xmean)*(y - ymean), axis=0)/(n)

#5. Compute correlation along time axis
cor = cov/(xstd*ystd)

#6. Compute regression slope and intercept:
slope = cov/(xstd**2)
intercept = ymean - xmean*slope

#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats

from scipy.stats import t
pval = t.sf(tstats, n-2)*2
pval = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)

return cov,cor,slope,intercept,pval,stderr

希望对您有所帮助!祈祷合并很快就会到来。

关于python - 计算 xarray 中缺失数据的相关性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58546999/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com