gpt4 book ai didi

python - 当索引不对齐时,pandas DataFrame 更新/合并

转载 作者:行者123 更新时间:2023-12-01 04:35:10 25 4
gpt4 key购买 nike

考虑两个数据帧,它们存储相同观察的相同特征的信息,但时间段不同:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({"obs":["a","a","b","b"],
"year":[1,2,1,2],
"val":[3, np.NaN, 3, np.NaN]})

df1

Out:
obs val year
0 a 3 1
1 a NaN 2
2 b 3 1
3 b NaN 2

df2 = pd.DataFrame({"obs":["a","a","b","b"],
"val":[np.NaN, 4, np.NaN, 4],
"year":[1,2,1,2]})
df2.index = (range(5,9))

df2

Out:
obs val year
5 a NaN 1
6 a 4 2
7 b NaN 1
8 b 4 2

现在我想合并或组合这两个数据框,以便将值收集在单个列中,并将 df1 中的 NaN 替换为相应的观察结果-来自 df2 的年份值。我可以通过这样做来实现这一目标:

merged = pd.merge(df1, df2, on=["obs", "year"], how="left")
merged.loc[~np.isfinite(merged.val_x), 'val_x'] = merged[~np.isfinite(merged.val_x)].val_y

即本质上是进行常规合并,然后手动将一列中的 NaN 替换为另一列的值。

有更好/更简洁的方法吗?我觉得某种 df.combinedf.combine_firstdf.update 会做我想要的事情,但它们似乎都对齐索引。

最佳答案

我假设您的目标是获取 merged['val_x'] 并且您确实不关心 merged 中的其他列。

以下是一些选项:

def using_merge(df1, df2):
merged = pd.merge(df1, df2, on=["obs", "year"], how="left")
mask = ~np.isfinite(merged.val_x)
merged.loc[mask, 'val_x'] = merged.loc[mask, 'val_y']
return merged['val_x']

def using_update(df1, d2):
merged = pd.merge(df1, df2, on=["obs", "year"], how="left")
merged['val_y'].update(merged['val_x'])
return merged['val_y']

def using_set_index(df1, df2):
df1 = df1.set_index(['obs','year'])
df2 = df2.set_index(['obs','year'])
return df1['val'].combine_first(df2['val'])

没有一个比其他的更简洁。但有一点性能差异:

import numpy as np
import pandas as pd
import itertools as IT

# generate a large-ish example
np.random.seed(2015)
N, M = 200, 200
df1 = pd.DataFrame(list(IT.product(np.arange(N), np.arange(M))),
columns=['obs','year'])
df1['val'] = np.random.choice([1,2,np.nan], size=len(df1))

df2 = pd.DataFrame(list(IT.product(np.arange(N), np.arange(M))),
columns=['obs','year'])
df2['val'] = np.random.choice([1,2,np.nan], size=len(df1))
df2.index = np.arange(len(df2)) + len(df1)

m1 = using_merge(df1, df2)
m2 = using_update(df1, df2)
m3 = using_set_index(df1, df2)
assert m3.reset_index(drop=True).equals(m1)
assert m1.equals(m2)
<小时/>
In [158]: %timeit using_merge(df1, df2)
100 loops, best of 3: 13.6 ms per loop

In [159]: %timeit using_update(df1, df2)
100 loops, best of 3: 12.3 ms per loop

In [160]: %timeit using_set_index(df1, df2)
100 loops, best of 3: 8 ms per loop

因此,对于较大的 DataFrame,设置索引,然后使用 combine_first 是值得的。

关于python - 当索引不对齐时,pandas DataFrame 更新/合并,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31831519/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com