python - 在 Pandas 数据框中加入关系表的层次结构-6ren

python - 在 Pandas 数据框中加入关系表的层次结构

转载作者：太空宇宙更新时间：2023-11-03 11:32:41

我想知道是否有任何方式可以利用类似于分层索引的功能，但在 pandas 表的数据中。我有兴趣将多个数据帧组合成一个数据帧，其中一些数据帧在另一个数据帧中有一个 ID 的多个条目。

一如既往，最好只显示结构。这是一个简化的数据框 1:

>>> df1
   id             txt
0   0      first sent
1   1     another one
2   2     I think you
3   3  will like this
4   4       will work

虽然数据帧 2 可能有几个属性对应于数据帧 1 的每个条目(按索引 ID):

>>> df2
   attr  id
0  chem   0
1   dis   0
2  chem   1
3  chem   1
4  chem   2
5   dis   2
6   dis   3
7   dis   3
8   dis   4
9  chem   4

所以尝试这样做:

import pandas as pd
id = range(0,5)
texts =  ['first sent', 'another one', 'I think you', 'will like this']
df = pd.DataFrame({'txt':texts, 'id':id})
df2 = pd.DataFrame({'attr':['chem', 'dis', 'chem', 'chem', 'chem', 'dis', 'dis', 'dis', 'dis', 'chem'] ,'id':[0,0,1,1,2,2,3,3,4,4]})

合并后简单地提供:

>>> df.merge(df2, on='id')
   id             txt  attr
0   0      first sent  chem
1   0      first sent   dis
2   1     another one  chem
3   1     another one  chem
4   2     I think you  chem
5   2     I think you   dis
6   3  will like this   dis
7   3  will like this   dis
8   4       will work   dis
9   4       will work  chem

现在您可以看到“txt”列是重复的 - 在这种情况下，IMO 是不必要的，如果 df2 中每个 id 的属性很多，可能会导致一些严重的内存问题。有可能(在这种情况下)复制的文本数据比将数据表示为两个单独的数据帧所需的数据大数千倍。

我考虑过尝试将“txt”列作为分层索引的索引(尽管我确信这完全是错误的设计考虑)，但即使如此，仍然存在重复。

>>> df.merge(df2, on='id').set_index(['id', 'txt'])
                   attr
id txt                 
0  first sent      chem
   first sent       dis
1  another one     chem
   another one     chem
2  I think you     chem
   I think you      dis
3  will like this   dis
   will like this   dis
4  will work        dis
   will work       chem

有没有办法将信息存储在单个数据框中？

最佳答案

这是一个使用 pandas 的内存高效解决方案 categories .对于结果中“txt”列中的每个值，成本现在只是一个整数，这比存储文本字符串要便宜得多。

import pandas as pd

ids = range(0,4)
texts =  ['first sent', 'another one', 'I think you', 'will like this']

df = pd.DataFrame({'txt':texts, 'id':ids})
df2 = pd.DataFrame({'attr':['chem', 'dis', 'chem', 'chem', 'chem', 'dis', 'dis', 'dis', 'dis', 'chem'] ,'id':[0,0,1,1,2,2,3,3,4,4]})

# convert to category codes and store mapping
df['txt'] = df['txt'].astype('category')
df_txt_cats = dict(enumerate(df['txt'].cat.categories))
df['txt'] = df['txt'].cat.codes

# perform merge - memory efficinet since result only uses integers
df_merged = df.merge(df2, on='id')

# rename categories from integers to text strings from previously stored mapping
df_merged['txt'] = df_merged['txt'].astype('category')
df_merged['txt'].cat.categories = list(map(df_txt_cats.get, df_merged['txt'].cat.categories))

df_merged.dtypes
# id         int32
# txt     category
# attr      object
# dtype: object

关于python - 在 Pandas 数据框中加入关系表的层次结构，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48497795/

文章推荐： c# - LINQ查询，忽略带有某些小数点的结果

文章推荐： Android Studio (3.1.4) 设计+蓝图的渲染问题

文章推荐： php - 在选择框中保留所选值

android - 从具有平面 View 层次 ConstraintLayout 的多个水平链创建垂直链
我正在尝试将多个水平链接的 Button 和 TextView 垂直链接为 View 集，但仍保持平面 View 层次结构。这是我的初始布局和代码:
machine-learning - 在Google BigQuery上训练模型后，如何获得其架构(层次，损失函数等)？
到目前为止，我已经在Google BigQuery上训练了几种模型，目前我需要查看模型的外观（即架构，损失函数等）。有没有办法获取这些信息？最佳答案仔细阅读文档后，我可以说该功能尚不存在。我什至
PHP实现二叉树深度优先遍历(前序、中序、后序)和广度优先遍历(层次)实例详解
本文实例讲述了PHP实现二叉树深度优先遍历(前序、中序、后序)和广度优先遍历(层次)。分享给大家供大家参考，具体如下：前言：深度优先遍历：对每一个可能的分支路径深入到不能再深入为止，而且每个

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 在 Pandas 数据框中加入关系表的层次结构