python - Pandas : Merge hierarchical data-6ren

python - Pandas : Merge hierarchical data

转载作者：太空宇宙更新时间：2023-11-03 11:29:31

我正在寻找一种将具有复杂层次结构的数据合并到pandas DataFrame中的方法。这种层次结构是由数据中不同的相互依存关系引起的。例如。有一些参数定义了数据的生成方式，然后有时间相关的可观测值，空间相关的可观测值以及取决于时间和空间的可观测值。

更明确地说:假设我有以下数据。

#  Parameters
t_max = 2
t_step = 15
sites = 4

# Purely time-dependent
t = np.linspace(0, t_max, t_step)
f_t = t**2 - t

# Purely site-dependent
position = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])  # (x, y)
site_weight = np.arange(sites)

# Time-, and site-dependent.
occupation = np.arange(t_step*sites).reshape((t_step, sites))

# Time-, and site-, site-dependent
correlation = np.arange(t_step*sites*sites).reshape((t_step, sites, sites))

(最后，我当然会拥有许多这样的数据集。每组参数一个。)

现在，我想将所有这些信息存储到pandas DataFrame中。我想象最终结果看起来像这样:

| ----- parameters ----- | -------------------------------- observables --------------------------------- |
|                        |                                        | ---------- time-dependent ----------- |
|                        | ----------- site-dependent --- )       ( ------------------------ |            |
|                        |                                | - site2-dependent - |                         |
| sites | t_max | t_step | site | r_x | r_y | site weight | site2 | correlation | occupation | f_t | time |

我认为部分重叠的层次结构可能无法实现。如果它们是隐式的，就可以了，例如通过以特定方式为DataFrame编制索引来获取所有与站点相关的数据。

另外，如果您认为有更好的方法可以在 Pandas 中安排此数据，请随时告诉我。

题

如何构造包含上述所有数据的 DataFrame，并以某种方式反射(reflect)相互依赖关系(例如 f_t取决于 time，而不取决于 site)。以及所有这些以足够通用的方式进行，因此很容易添加或删除某些可观察对象，并且可能具有新的相互依赖关系。 (例如，取决于第二时间轴的数量，例如时间-时间相关性。)

我到目前为止所得到的

在下文中，我将向您展示我已经走了多远。但是，我认为这不是实现上述目标的理想方法。特别是，由于缺乏关于添加或删除某些可观察对象的通用性。

指标

鉴于以上数据，我首先定义了我需要的所有多索引。

ind_time = pd.Index(t, name='time')
ind_site = pd.Index(np.arange(sites), name='site')
ind_site_site = pd.MultiIndex.from_product([ind_site, ind_site], names=['site', 'site2'])
ind_time_site = pd.MultiIndex.from_product([ind_time, ind_site], names=['time', 'site'])
ind_time_site_site = pd.MultiIndex.from_product([ind_time, ind_site, ind_site], names=['time', 'site', 'site2'])

单个 DataFrame
接下来，我创建了各个数据块的数据帧。

df_parms = pd.DataFrame({'t_max': t_max, 't_step': t_step, 'sites': sites}, index=[0])
df_time = pd.DataFrame({'f_t': f_t}, index=ind_time)
df_position = pd.DataFrame(position, columns=['r_x', 'r_y'], index=ind_site)
df_weight = pd.DataFrame(site_weight, columns=['site weight'], index=ind_site)
df_occupation = pd.DataFrame(occupation.flatten(), index=ind_time_site, columns=['occupation'])
df_correlation = pd.DataFrame(correlation.flatten(), index=ind_time_site_site, columns=['correlation'])

index=[0]中的 df_parms似乎是必需的，因为否则Pandas只会提示标量值。实际上，我可能会用运行此特定模拟的时间戳来代替它。那至少会传达一些有用的信息。

合并可观察对象

有了可用的数据帧，我将所有可观察的对象合并为一个大 DataFrame。

df_all_but_parms = pd.merge(
  pd.merge(
    pd.merge(
      df_time.reset_index(),
      df_occupation.reset_index(),
      how='outer'
    ),
    df_correlation.reset_index(),
    how='outer'
  ),
  pd.merge(
    df_position.reset_index(),
    df_weight.reset_index(),
    how='outer'
  ),
  how='outer'
)

这是我目前最不喜欢的一点。 merge函数仅适用于成对的数据帧，并且要求它们至少具有一个公共(public)列。因此，我必须注意连接数据框的顺序，如果要添加一个正交的可观察对象，则无法将其与其他数据合并，因为它们不会共享公共(public)列。是否有一个函数可以仅通过一次调用数据帧列表就可以达到相同的结果？我尝试了 concat，但它不会合并普通列。因此，最后我得到了很多重复的 time和 site列。

合并所有数据

最后，我将数据与参数合并。

pd.concat([df_parms, df_all_but_parms], axis=1, keys=['parameters', 'observables'])

到目前为止，最终结果如下所示:

         parameters                 observables                                                                       
              sites  t_max  t_step         time       f_t  site  occupation  site2  correlation  r_x  r_y  site weight
    0             4      2      15     0.000000  0.000000     0           0      0            0    0    0            0
    1           NaN    NaN     NaN     0.000000  0.000000     0           0      1            1    0    0            0
    2           NaN    NaN     NaN     0.000000  0.000000     0           0      2            2    0    0            0
    3           NaN    NaN     NaN     0.000000  0.000000     0           0      3            3    0    0            0
    4           NaN    NaN     NaN     0.142857 -0.122449     0           4      0           16    0    0            0
    ..          ...    ...     ...          ...       ...   ...         ...    ...          ...  ...  ...          ...
    235         NaN    NaN     NaN     1.857143  1.591837     3          55      3          223    1    1            3
    236         NaN    NaN     NaN     2.000000  2.000000     3          59      0          236    1    1            3
    237         NaN    NaN     NaN     2.000000  2.000000     3          59      1          237    1    1            3
    238         NaN    NaN     NaN     2.000000  2.000000     3          59      2          238    1    1            3
    239         NaN    NaN     NaN     2.000000  2.000000     3          59      3          239    1    1            3

如您所见，这并不是很好，因为实际上只给第一行分配了参数。其他所有行仅使用 NaN代替参数。但是，由于这些是所有数据的参数，因此它们也应包含在此数据帧的所有其他行中。

一个小问题:如果将上述数据帧存储在hdf5中， Pandas 会变得多么聪明。我最终会得到很多重复的数据，还是会避免重复存储？

更新资料

多亏了 Jeff's answer，我能够使用通用合并将所有数据推送到一个数据帧中。基本思想是，我所有的可观测对象都已经有一些通用列。即参数。

首先，我将参数添加到所有可观测数据的数据帧中。

all_observables = [ df_time, df_position, df_weight, df_occupation, df_correlation ]
flat = map(pd.DataFrame.reset_index, all_observables)
for df in flat:
    for c in df_parms:
        df[c] = df_parms.loc[0,c]

然后，我可以通过归约将它们全部合并在一起。

df_all = reduce(lambda a, b: pd.merge(a, b, how='outer'), flat)

其结果具有所需的形式:

         time       f_t  sites  t_max  t_step  site  r_x  r_y  site weight  occupation  site2  correlation
0    0.000000  0.000000      4      2      15     0    0    0            0           0      0            0
1    0.000000  0.000000      4      2      15     0    0    0            0           0      1            1
2    0.000000  0.000000      4      2      15     0    0    0            0           0      2            2
3    0.000000  0.000000      4      2      15     0    0    0            0           0      3            3
4    0.142857 -0.122449      4      2      15     0    0    0            0           4      0           16
5    0.142857 -0.122449      4      2      15     0    0    0            0           4      1           17
6    0.142857 -0.122449      4      2      15     0    0    0            0           4      2           18
..        ...       ...    ...    ...     ...   ...  ...  ...          ...         ...    ...          ...
233  1.857143  1.591837      4      2      15     3    1    1            3          55      1          221
234  1.857143  1.591837      4      2      15     3    1    1            3          55      2          222
235  1.857143  1.591837      4      2      15     3    1    1            3          55      3          223
236  2.000000  2.000000      4      2      15     3    1    1            3          59      0          236
237  2.000000  2.000000      4      2      15     3    1    1            3          59      1          237
238  2.000000  2.000000      4      2      15     3    1    1            3          59      2          238
239  2.000000  2.000000      4      2      15     3    1    1            3          59      3          239

通过重新索引数据，层次结构变得更加明显:

df_all.set_index(['t_max', 't_step', 'sites', 'time', 'site', 'site2'], inplace=True)

导致

                                             f_t  r_x  r_y  site weight  occupation  correlation
t_max t_step sites time     site site2                                                          
2     15     4     0.000000 0    0      0.000000    0    0            0           0            0
                                 1      0.000000    0    0            0           0            1
                                 2      0.000000    0    0            0           0            2
                                 3      0.000000    0    0            0           0            3
                   0.142857 0    0     -0.122449    0    0            0           4           16
                                 1     -0.122449    0    0            0           4           17
                                 2     -0.122449    0    0            0           4           18
...                                          ...  ...  ...          ...         ...          ...
                   1.857143 3    1      1.591837    1    1            3          55          221
                                 2      1.591837    1    1            3          55          222
                                 3      1.591837    1    1            3          55          223
                   2.000000 3    0      2.000000    1    1            3          59          236
                                 1      2.000000    1    1            3          59          237
                                 2      2.000000    1    1            3          59          238
                                 3      2.000000    1    1            3          59          239

最佳答案

我认为您应该这样做，将df_parms用作索引。这样，您可以轻松地用不同的格式连接更多帧。

In [67]: pd.set_option('max_rows',10)

In [68]: dfx = df_all_but_parms.copy()

您需要将列分配给框架(您也可以直接构造多索引，但这是从您的数据开始的)。

In [69]: for c in df_parms.columns:
             dfx[c] = df_parms.loc[0,c]

In [70]: dfx
Out[70]: 
         time       f_t  site  occupation  site2  correlation  r_x  r_y  site weight  sites  t_max  t_step
0    0.000000  0.000000     0           0      0            0    0    0            0      4      2      15
1    0.000000  0.000000     0           0      1            1    0    0            0      4      2      15
2    0.000000  0.000000     0           0      2            2    0    0            0      4      2      15
3    0.000000  0.000000     0           0      3            3    0    0            0      4      2      15
4    0.142857 -0.122449     0           4      0           16    0    0            0      4      2      15
..        ...       ...   ...         ...    ...          ...  ...  ...          ...    ...    ...     ...
235  1.857143  1.591837     3          55      3          223    1    1            3      4      2      15
236  2.000000  2.000000     3          59      0          236    1    1            3      4      2      15
237  2.000000  2.000000     3          59      1          237    1    1            3      4      2      15
238  2.000000  2.000000     3          59      2          238    1    1            3      4      2      15
239  2.000000  2.000000     3          59      3          239    1    1            3      4      2      15

[240 rows x 12 columns]

设置索引(这将返回一个新对象)

In [71]: dfx.set_index(['sites','t_max','t_step'])
Out[71]: 
                        time       f_t  site  occupation  site2  correlation  r_x  r_y  site weight
sites t_max t_step                                                                                 
4     2     15      0.000000  0.000000     0           0      0            0    0    0            0
            15      0.000000  0.000000     0           0      1            1    0    0            0
            15      0.000000  0.000000     0           0      2            2    0    0            0
            15      0.000000  0.000000     0           0      3            3    0    0            0
            15      0.142857 -0.122449     0           4      0           16    0    0            0
...                      ...       ...   ...         ...    ...          ...  ...  ...          ...
            15      1.857143  1.591837     3          55      3          223    1    1            3
            15      2.000000  2.000000     3          59      0          236    1    1            3
            15      2.000000  2.000000     3          59      1          237    1    1            3
            15      2.000000  2.000000     3          59      2          238    1    1            3
            15      2.000000  2.000000     3          59      3          239    1    1            3

[240 rows x 9 columns]

关于python - Pandas : Merge hierarchical data，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24754496/

文章推荐： python - 从 Bash 文件运行 Python 脚本导致导入错误

文章推荐： c# - 带有 DataContractJsonSerializer 的 InvalidCastException

文章推荐： c# - 多个 ContextMenuStrip 的 ToolStripMenuItem

文章推荐： c# - 使用股票报价数据构建基于时间的条形图

Oracle Hierarchical Query 将不同的级别放入不同的列中
我有这样的源数据 Childid | Parent ID ------- | --------- 1 | NULL 2 | 1 3 | 1 4 | 2 5 | 4 6 | 5 7 | 6 我需要一个
wordpress - 'hierarchical' => true 不显示父选择器
我想显示父级选择，以便我可以选择父级以进行布局。这个问题在 WordPress 论坛中没有得到解答。它可能是一个缺失或即将推出的功能。有人让它工作吗？ register_post_type('foo
mysql - Hierarchical Query 将中级类别归为顶级
我有一个包含 3 级分层数据的表，但我的查询对级别的分类不正确。我的查询: SELECT t1.name AS lev1, t2.name as lev2, t3.name as lev3 FRO
java - 加载资源时"URI is not hierarchical"
我需要一个文件的 URI(我将其放入资源目录中)。如果我使用 MyClass.class.getClassLoader().getResource(resource) 我明白 java.lang.Il
java - "URI is not hierarchical"+ 私钥
我刚开始在我的 Java 程序(为 Windows 7 制作)中使用 Jsch 进行 SSH 连接。我遇到了将私钥合并到我的程序中的问题。我使用以下代码: URL keyFileURL = Main.
database - Hierarchical Hibernate，执行了多少查询？
所以我一直在处理一个有一些严重缺陷的自制 DB 框架，使用的理由是不使用 ORM 将节省执行的查询数量。如果我从可连接对象层次结构的顶层选择所有可能的记录，那么在使用 ORM(例如 Hibernat
hierarchical-data - 检查复杂层次模型 JAGS 中的收敛性
我估计了一个具有许多随机效应的复杂层次模型，但我真的不知道检查收敛性的最佳方法是什么。我有来自几百个人的复杂纵向数据，并为每个人估计了相当多的参数。正因为如此，我可以通过许多跟踪图进行目视检查。或者我
sql - 加入两个 Hierarchical 查询以形成更大的 Hierarchy
我已经对此进行了研究，并且知道我不是第一个提出问题的人，但我似乎无法理解它。我创建了一个简单的示例，如果有人可以提供缺失的链接，我认为它可以帮助我破解它! 我有一个区域表，其中包含层次结构中的大陆和国
hierarchical-data - EF4 CTP5 自引用分层实体映射
好吧，这应该很容易，但我一直在撕扯我的头发。这是我的 POCO(它与机器零件有关，所以一个零件可以包含在父零件中): public class Part { public int ID { ge
hierarchical-data - 回发时未调用 HierarchicalDataBoundControl.PerformDataBinding
我正在绑定(bind)到 SiteMapDataSource(分层)。我正在重写 PerformDataBinding 以从数据源中获取数据。页面加载时一切正常。但是，当我在页面上的任何位置执行回
c# - 动态创建 Hierarchical ContextMenu MVVM
我想从 ViewModel 中的数据动态创建 Hierarchical ContextMenu。在 ViewMode 中，我定义了 ContextMenuAction: public class C
hierarchical-data - 带有 TreeView 小部件的主干
我正在评估 Backbone javascript framework用于在 TreeView 小部件中显示分层模型的项目(想想 Windows 文件浏览器)。我喜欢 Backbone 对世界的看法
memory-management - 多级页表 Hierarchical paging
考虑一个具有 32 位虚拟地址和 1KB 页面的虚拟内存系统。每个页表条目都需要 32 位。希望将页表大小限制为一页。需要多少级页表？两个级别的表有 256 个条目；一个级别的表有 64 个条目。
sql - 甲骨文 : Hierarchical Query Connect By
我编写了一个 Oracle 层次结构查询，它将为我们提供特定员工的高层管理人员。例如如果我们有示例 Emp 和 Manager 映射，例如: WITH emp_manager_mapping AS
memory-management - 多级页表 Hierarchical paging
考虑一个具有 32 位虚拟地址和 1KB 页面的虚拟内存系统。每个页表条目都需要 32 位。希望将页表大小限制为一页。需要多少级页表？两个级别的表有 256 个条目；一个级别的表有 64 个条目。
javascript - Hierarchical Edge Bundling json数据集构建理解
我想了解如何构建数据集以通过分层边缘捆绑进行可视化。这是我的测试数据集 [ {"name":"Flare.Expertise.Informatics","imports":["F
Python、XML 和多个 "hierarchical"默认命名空间
我正在尝试使用 Python 和 xml.etree.ElementTree 处理 XML 文件，但遇到多个“分层”默认 namespace 的问题。我需要做的是更改一些节点的文本字段的内容，然后以相
python - Pandas : Merge hierarchical data
我正在寻找一种将具有复杂层次结构的数据合并到pandas DataFrame中的方法。这种层次结构是由数据中不同的相互依存关系引起的。例如。有一些参数定义了数据的生成方式，然后有时间相关的可观测值，空
拉维尔 : create hierarchical route for category
我正在实现类别结构，有些产品会有一级类别，但其他产品可能有两级或更多级别: /posts/cat2/post-sulg /posts/cat-1/sub-1/post-slug /posts/cat-
url - 人类可读的 URL : preferably hierarchical too?
在关于人类可读 URL 的 now migrated question 中，我允许自己详细说明我的一个小爱好: When I encounter URLs like http://www.exampl

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - Pandas : Merge hierarchical data