gpt4 book ai didi

python - 何时在 pandas 中使用多索引与 xarray

转载 作者:太空狗 更新时间:2023-10-29 20:15:16 24 4
gpt4 key购买 nike

pandas pivot tables documentation似乎建议使用多索引处理多于两个维度的数据:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import pandas.util.testing as tm; tm.N = 3

In [4]: def unpivot(frame):
...: N, K = frame.shape
...: data = {'value' : frame.values.ravel('F'),
...: 'variable' : np.asarray(frame.columns).repeat(N),
...: 'date' : np.tile(np.asarray(frame.index), K)}
...: return pd.DataFrame(data, columns=['date', 'variable', 'value'])
...:

In [5]: df = unpivot(tm.makeTimeDataFrame())

In [6]: df
Out[6]:
date variable value value2
0 2000-01-03 A 0.462461 0.924921
1 2000-01-04 A -0.517911 -1.035823
2 2000-01-05 A 0.831014 1.662027
3 2000-01-03 B -0.492679 -0.985358
4 2000-01-04 B -1.234068 -2.468135
5 2000-01-05 B 1.725218 3.450437
6 2000-01-03 C 0.453859 0.907718
7 2000-01-04 C -0.763706 -1.527412
8 2000-01-05 C 0.839706 1.679413
9 2000-01-03 D -0.048108 -0.096216
10 2000-01-04 D 0.184461 0.368922
11 2000-01-05 D -0.349496 -0.698993

In [7]: df['value2'] = df['value'] * 2

In [8]: df.pivot('date', 'variable')
Out[8]:
value value2 \
variable A B C D A B
date
2000-01-03 -1.558856 -1.144732 -0.234630 -1.252482 -3.117712 -2.289463
2000-01-04 -1.351152 -0.173595 0.470253 -1.181006 -2.702304 -0.347191
2000-01-05 0.151067 -0.402517 -2.625085 1.275430 0.302135 -0.805035


variable C D
date
2000-01-03 -0.469259 -2.504964
2000-01-04 0.940506 -2.362012
2000-01-05 -5.250171 2.550861

我认为 xarray 是为处理这样​​的多维数据集而设计的:

In [9]: import xarray as xr

In [10]: xr.DataArray(dict([(var, df[df.variable==var].drop('variable', 1)) for var in np.unique(df.variable)]))
Out[10]:
<xarray.DataArray ()>
array({'A': date value value2
0 2000-01-03 0.462461 0.924921
1 2000-01-04 -0.517911 -1.035823
2 2000-01-05 0.831014 1.662027, 'C': date value value2
6 2000-01-03 0.453859 0.907718
7 2000-01-04 -0.763706 -1.527412
8 2000-01-05 0.839706 1.679413, 'B': date value value2
3 2000-01-03 -0.492679 -0.985358
4 2000-01-04 -1.234068 -2.468135
5 2000-01-05 1.725218 3.450437, 'D': date value value2
9 2000-01-03 -0.048108 -0.096216
10 2000-01-04 0.184461 0.368922
11 2000-01-05 -0.349496 -0.698993}, dtype=object)

这些方法中的一种比另一种更好吗?为什么 xarray 没有完全取代多索引?

最佳答案

似乎确实有过渡到 xarray 来处理多维数组的工作。 Pandas 将减少对 3D 面板数据结构和 documentation even suggest using xarray for working with multidemensional arrays 中的支持。 :

'Oftentimes, one can simply use a MultiIndex DataFrame for easily working with higher dimensional data.

In addition, the xarray package was built from the ground up, specifically in order to support the multi-dimensional analysis that is one of Panel s main use cases. Here is a link to the xarray panel-transition documentation.'

来自xarray documentation他们陈述了他们的目的和目标:

xarray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data...

...Our target audience is anyone who needs N-dimensional labelled arrays, but we are particularly focused on the data analysis needs of physical scientists – especially geoscientists who already know and love netCDF

与使用直接 numpy 相比,xarray 的主要优势在于它使用标签的方式与 pandas 在多个维度上使用的方式相同。如果您使用多索引处理 3 维数据,则 xarray 可能可以互换。随着数据集中维数的增加,xarray 变得更易于管理。我无法评论每个人在效率或速度方面的表现。

关于python - 何时在 pandas 中使用多索引与 xarray,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42876278/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com