gpt4 book ai didi

python - pandas join DF - 合并与加入不同的语义

转载 作者:太空宇宙 更新时间:2023-11-04 00:37:20 26 4
gpt4 key购买 nike

我想在 pandas 中加入 2 个 DF。一些列是 int 或 float,其他的是类别。 (不对 A 和 B df 中的类别执行相同的 cat 代码/索引)它们的公共(public)列是大小为 8 的 float 和类别列的列表。

加入方式

df_a.merge(df_b, how='inner'), on=join_columns )

将完全不返回任何结果。并通过

加入
df_a.join(df_b, lsuffix='_l', rsuffix='_r')

似乎有效。

但我有点困惑,为什么失败了,如果我不应该将所有列转换为对象,以防止通过 cat 代码加入,这可能是错误的。

即如果选择 left 作为 merge 的连接方法,连接的列将只包含 NAN 值。不幸的是,我不太确定如何构建有用的最小示例。

编辑

这里是一个例子

import pandas as pd

raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'name': ['A', 'B', 'C', 'D', 'E'],
'nationality': ['DE', 'AUT', 'US', 'US', 'US'],
'age_group' : [1, 2, 1, 3, 1]}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'age_group'])
df_a.nationality = df_a.nationality.astype('category')
df_a


raw_data = {
'subject_id': ['1', '2', '3' ],
'name': ['Billy', 'Brian', 'Bran'],
'nationality': ['DE', 'US', 'US'],
'age_group' : [1, 1, 3],
'average_return_per_group' : [1.5, 2.3, 1.4]}
df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'age_group', 'average_return_per_group'])
df_b.nationality = df_b.nationality.astype('category')
df_b


# some result is joined
df_a.join(df_b, lsuffix='_l', rsuffix='_r')

# this *fails* as only NULL values joined, or nor result for inner join
df_a.merge(df_b, how='left', on=['nationality', 'age_group'])

最佳答案

join默认情况下沿着索引连接,merge沿着具有相同名称的列。

检查这个:

In [115]: df_a.join(df_b, lsuffix='_l', rsuffix='_r')
Out[115]:
subject_id_l name_l nationality_l age_group_l subject_id_r name_r nationality_r age_group_r average_returns_per_group
0 1 A DE 1 1 Billy DE 1.0 NaN
1 2 B AUT 2 2 Brian US 1.0 NaN
2 3 C US 1 3 Bran US 3.0 NaN
3 4 D US 3 NaN NaN NaN NaN NaN
4 5 E US 1 NaN NaN NaN NaN NaN

让我们设置['a','b','c']作为 df_b 中的索引然后尝试再次加入 - 你只会看到 NaN总共*_r列:

In [116]: df_a.join(df_b.set_index(pd.Index(['a','b','c'])), lsuffix='_l', rsuffix='_r')
Out[116]:
subject_id_l name_l nationality_l age_group_l subject_id_r name_r nationality_r age_group_r average_returns_per_group
0 1 A DE 1 NaN NaN NaN NaN NaN
1 2 B AUT 2 NaN NaN NaN NaN NaN
2 3 C US 1 NaN NaN NaN NaN NaN
3 4 D US 3 NaN NaN NaN NaN NaN
4 5 E US 1 NaN NaN NaN NaN NaN

In [117]: df_b.set_index(pd.Index(['a','b','c']))
Out[117]:
subject_id name nationality age_group average_returns_per_group
a 1 Billy DE 1 NaN
b 2 Brian US 1 NaN
c 3 Bran US 3 NaN

更新: IMO merge按预期工作(在文档中描述)

In [151]: df_a.merge(df_b, on=['nationality', 'age_group'], how='left', suffixes=['_l','_r'])
Out[151]:
subject_id_l name_l nationality age_group subject_id_r name_r average_return_per_group
0 1 A DE 1 1 Billy 1.5
1 2 B AUT 2 NaN NaN NaN
2 3 C US 1 2 Brian 2.3
3 4 D US 3 3 Bran 1.4
4 5 E US 1 2 Brian 2.3

关于python - pandas join DF - 合并与加入不同的语义,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43453542/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com