gpt4 book ai didi

python - 同时迭代/列表理解问题(在 UDF 中获取 Pandas 中的合并报告)

转载 作者:行者123 更新时间:2023-11-28 19:23:24 25 4
gpt4 key购买 nike

Stata 用户会知道,当合并数据时,会生成一个 _merge 变量,该变量由 _merge 为 1 表示 merge 已成功observation,通过 2 表示观察仅在主数据集中,或者通过 3 表示观察仅在 using 数据集中。我试图通过编写自己的函数在 Pandas 中重新创建它。我有以下工作:

def MergeReport(DF1, DF2, keys, HOW = 'outer'):

if len(keys) == 1:
KeysDF1 = set(DF1[keys[0]])
KeysDF2 = set(DF2[keys[0]])
MasterError = list(KeysDF1.difference(KeysDF2))
UsingError = list(KeysDF2.difference(KeysDF1))
MG = pd.merge(DF1, DF2, on = keys, how = HOW)
MG['_merge'] = None

for row in MG.index:
if MG[keys[0]][row] in MasterError:
MG['_merge'][row] = "merge_2"
elif MG[keys[0]][row] in UsingError:
MG['_merge'][row] = "merge_3"
else:
MG['_merge'][row] = "merge_1"

return MG

else:
KeysDF1 = set(zip(DF1[keys[0]], DF1[keys[1]]))
KeysDF2 = set(zip(DF2[keys[0]], DF2[keys[1]]))
MasterError = list(KeysDF1.difference(KeysDF2))
UsingError = list(KeysDF2.difference(KeysDF1))
MG = pd.merge(DF1, DF2, on = keys, how = HOW)
MG['_merge'] = None

for row in MG.index:
if tuple([MG[keys[0]][row], MG[keys[1]][row]]) in MasterError:
MG['_merge'][row] = "merge_2"
elif tuple([MG[keys[0]][row], MG[keys[1]][row]]) in UsingError:
MG['_merge'][row] = "merge_3"
else:
MG['_merge'][row] = "merge_1"

return MG

参数是 DataFrame1、DataFrame2、“键”列表(即要合并的列)和传递给 pd.merge 参数 how = HOW 的可选参数 HOW。最终参数将扩展到 pd.merge 函数中的所有参数。

我的问题很明显:我不知道如何编写代码才能接受任意长度的键列表。我的问题发生在这里:

KeysDF1 = set(zip(DF1[keys[0]], DF1[keys[1]]))      
KeysDF2 = set(zip(DF2[keys[0]], DF2[keys[1]]))

我无法弄清楚如何编写此代码以便我可以遍历任意长度的键列表。特别是我尝试了列表理解:

KeysDF1 =   set(zip(tuple([DF1[keys[x]] for x in range(len(keys))])))   

但这没有用,因为“系列对象是可变的,它们不能被散列”。我想此时我也会在代码中发现类似的问题:

if tuple([MG[keys[0]][row], MG[keys[1]][row]]) in MasterError:
MG['_merge'][row] = "merge_2"

编辑:根据另一位用户的建议,我发布了实现相同目标的替代方法。我并不是建议将此作为问题本身的解决方案,只是一种避免问题的方法:

def MergeReport(DF1, DF2, how = 'inner', on = None, left_on = None, right_on = None, \    left_index = False, right_index = False, \sort = False, suffixes = ('_x', '_y'), copy = True):
DF1['Master'] = 1
DF2['Using'] = 2

MDF = pd.merge(DF1, DF2, how = how, on = on, left_on = left_on, right_on = right_on, left_index = left_index, right_index = right_index, \
sort = sort, suffixes = suffixes, copy = copy)

MDF['Master'].fillna(value = 0, inplace = True)
MDF['Using'].fillna(value = 0, inplace = True)
MDF['_Merge'] = MDF['Master'] + MDF['Using']
del MDF['Master']
del MDF['Using']
List = ['1_MasterOnly', '2_UsingOnly', '3_Matched']
LIST = [List[int(MDF['_Merge'][row] - 1)] for row in MDF.index]
MDF['_Merge'] = np.array(LIST)
return MDF

最佳答案

不确定我是否理解正确,但是概括

KeysDF1 = set(zip(DF1[keys[0]], DF1[keys[1]]))

我觉得不是

KeysDF1 =   set(zip(tuple([DF1[keys[x]] for x in range(len(keys))])))

而是

KeysDF1 =   set(zip(*tuple([DF1[keys[x]] for x in range(len(keys))])))

或者只是

KeysDF1 =   set(zip(*[DF1[keys[x]] for x in range(len(keys))]))

关于python - 同时迭代/列表理解问题(在 UDF 中获取 Pandas 中的合并报告),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19155625/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com