gpt4 book ai didi

python - 从 2D DataFrames/Arrays 的字典创建 DataArray

转载 作者:行者123 更新时间:2023-11-28 21:47:09 25 4
gpt4 key购买 nike

我正在尝试从 Pandas 过渡到 Xarray 以获取 N-Dimensional DataArrays 以扩展我的轨道。

实际上,我将有一堆不同的pd.DataFrames(在本例中为 row=month,col=attribute)沿特定轴(下面模拟示例中的患者)我想合并(不使用面板或 multindex :),谢谢)。我想将它们转换为 xr.DataArrays,这样我就可以在它们的基础上构建维度。我制作了一个模拟数据集来说明我在说什么。

对于我制作的这个数据集,假设 100 名患者,12 个月,10000 个属性,3 个重复(每个属性),这将是一个典型的 4D 数据集。基本上,我通过 mean 压缩每个属性的 3 个副本,所以我最终得到一个 2D pd.DataFrame(row=months, col=attributes) 这个 DataFrame 是我字典中的值,它来自的患者是关键(即 (patient_x : DataFrame_X) )

我还将介绍我使用 np.ndarray 占位符的方法,但是如果我可以从字典生成 N 维 DataArray 会非常方便其键为 patient_x,值为 DataFrame_X

如何使用 Pandas DataFrames 字典中的 Xarray 创建 N 维 DataArray

import xarray as xr
import numpy as np
import pandas as pd

np.random.seed(1618033)

#Set dimensions
a,b,c,d = 100,12,10000,3 #100 patients, 12 months, 10000 attributes, 3 replicates

#Create labels
patients = ["patient_%d" % i for i in range(a)]
months = [j for j in range(b)]
attributes = ["attr_%d" % k for k in range(c)]
replicates = [l for l in range(d)]

coords = [patients,months,attributes]
dims = ["Patients","Months","Attributes"]

#Dict of DataFrames
D_patient_DF = dict()

for i, patient in enumerate(patients):
A_placeholder = np.zeros((b,c))
for j, month in enumerate(months):
#Attribute x Replicates
A_attrReplicates = np.random.random((c,d))
#Collapse into 1D Vector
V_attrExp = A_attrReplicates.mean(axis=1)
#Fill array with row
A_placeholder[j,:] = V_attrExp
#Assign dataframe for every patient
DF_data = pd.DataFrame(A_placeholder, index = months, columns = attributes)
D_patient_DF[patient] = DF_data

xr.DataArray(D_patient_DF).dims
#() its empty

D_patient_DF
#{'patient_0': attr_0 attr_1 attr_2 attr_3 attr_4 attr_5 attr_6 \
# 0 0.445446 0.422018 0.343454 0.140700 0.567435 0.362194 0.563799
# 1 0.440010 0.548535 0.810903 0.482867 0.469542 0.591939 0.579344
# 2 0.645719 0.450773 0.386939 0.418496 0.508290 0.431033 0.622270
# 3 0.555855 0.633393 0.555197 0.556342 0.489865 0.204200 0.823043
# 4 0.916768 0.590534 0.597989 0.592359 0.484624 0.478347 0.507789
# 5 0.847069 0.634923 0.591008 0.249107 0.655182 0.394640 0.579700
# 6 0.700385 0.505331 0.377745 0.651936 0.334216 0.489728 0.282544
# 7 0.777810 0.423889 0.414316 0.389318 0.565144 0.394320 0.511034
# 8 0.440633 0.069643 0.675037 0.365963 0.647660 0.520047 0.539253
# 9 0.333213 0.328315 0.662203 0.594030 0.790758 0.754032 0.602375
# 10 0.470330 0.419496 0.171292 0.677439 0.683759 0.646363 0.465788
# 11 0.758556 0.674664 0.801860 0.612087 0.567770 0.801514 0.179939

最佳答案

从 DataFrame 的字典中,您可以将每个值转换为 DataArray(添加维度标签),将结果加载到数据集中,然后转换为 DataArray:

variables = {k: xr.DataArray(v, dims=['month', 'attribute'])
for k, v in D_patient_DF.items()}
combined = xr.Dataset(variables).to_array(dim='patient')
print(combined)

但是,请注意结果不一定按排序顺序排列,而是使用字典迭代的任意顺序。如果你想要排序顺序,你应该使用 OrderedDict 代替(在上面设置 variables 之后插入):

variables = collections.OrderedDict((k, variables[k]) for k in patients)

这个输出:

<xarray.DataArray (patient: 100, month: 12, attribute: 10000)>
array([[[ 0.61176399, 0.26172557, 0.74657302, ..., 0.43742111,
0.47503291, 0.37263983],
[ 0.34970732, 0.81527751, 0.53612895, ..., 0.68971198,
0.68962168, 0.75103198],
[ 0.71282751, 0.23143891, 0.28481889, ..., 0.52612376,
0.56992843, 0.3483683 ],
...,
[ 0.84627257, 0.5033482 , 0.44116194, ..., 0.55020168,
0.48151353, 0.36374339],
[ 0.53336826, 0.59566147, 0.45269417, ..., 0.41951078,
0.46815364, 0.44630235],
[ 0.25720899, 0.18738289, 0.66639783, ..., 0.36149276,
0.58865823, 0.33918553]],

...,

[[ 0.42933273, 0.58642504, 0.38716496, ..., 0.45667285,
0.72684589, 0.52335464],
[ 0.34946576, 0.35821339, 0.33097093, ..., 0.59037927,
0.30233665, 0.6515749 ],
[ 0.63673498, 0.31022272, 0.65788374, ..., 0.47881873,
0.67825066, 0.58704331],
...,
[ 0.44822441, 0.502429 , 0.50677081, ..., 0.4843405 ,
0.84396521, 0.45460029],
[ 0.61336348, 0.46338301, 0.60715273, ..., 0.48322379,
0.66530209, 0.52204897],
[ 0.47520639, 0.43490559, 0.27309414, ..., 0.35280585,
0.30280485, 0.77537204]]])
Coordinates:
* month (month) int64 0 1 2 3 4 5 6 7 8 9 10 11
* patient (patient) <U10 'patient_80' 'patient_73' 'patient_79' ...
* attribute (attribute) object 'attr_0' 'attr_1' 'attr_2' 'attr_3' ...

或者,您可以创建一个 2D DataArrays 列表,然后使用 concat:

patient_list = []
for i, patient in enumerate(patients):
df = ...
array = xr.DataArray(df, dims=['patient', 'attribute'])
patient_list.append(df)
combined = xr.concat(patient_list, dim=pd.Index(patients, name='patient')

这会给出相同的结果,并且可能是最干净的代码。

关于python - 从 2D DataFrames/Arrays 的字典创建 DataArray,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36948476/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com