I want to impute missing values using the grouped summary statistics based on a different dataframe. For instance, I would like to impute missing values in numvar_original
in df1
to be like numvar_ideal
where the missing values are based on group-level means from df2
:
我想使用基于不同数据帧的分组汇总统计信息来计算缺失值。例如,我想将df1中的numvar_Origin中的缺失值归结为与Numvar_Ideas类似,其中缺失值基于df2中的组级别平均值:
df1
catvar numvar_original numvar_ideal
1 10 10
1 NaN 5.5
2 30 30
2 NaN 6.5
df2
catvar numvar_original
1 5
1 6
2 6
2 7
# I tried the following:
df1['numvar'].fillna(df2.groupby('catvar')['numvar'].transform('mean'), inplace=True)
# The missing values weren't replaced
df1['numvar'].fillna(df1.groupby('catvar')['numvar'].transform('mean'), inplace=True)
# I checked that this works in filling up the missing values but I have to use group-level mean values from df1
Is there a way to do this without resorting to a combination of apply
or map
with a dictionary (since I've read that apply/map can be slower with larger datasets)?
有没有一种方法可以做到这一点,而不必求助于apply或map与字典的组合(因为我已经读到apply/map在较大的数据集上会更慢)?
更多回答
Can you please edit your question and put there sample input and example output? (in text form)
你可以编辑你的问题,把样本输入和样本输出放在一起吗?(文本形式)
I've edited to include the sample input and intended output
我已经进行了编辑,以包括样例输入和预期输出
优秀答案推荐
Try:
尝试:
means = df2.groupby("catvar")["numvar_original"].mean().to_dict()
df1 = df1.groupby("catvar", group_keys=False).apply(
lambda x: x.fillna(means[x["catvar"].iloc[0]])
)
print(df1)
Prints:
打印:
catvar numvar_original numvar_ideal
0 1 10.0 10.0
1 1 5.5 5.5
2 2 30.0 30.0
3 2 6.5 6.5
OR: Using .merge
:
或者:使用.merge:
means = df2.groupby("catvar")["numvar_original"].mean()
df1["numvar_original"] = df1["numvar_original"].fillna(
df1.merge(means, on="catvar")["numvar_original_y"]
)
print(df1)
Prints:
打印:
catvar numvar_original numvar_ideal
0 1 10.0 10.0
1 1 5.5 5.5
2 2 30.0 30.0
3 2 6.5 6.5
更多回答
Thanks! Both work well with the sample dataset. However, for the first approach, how can I make it more targeted (assuming there are other variables in df1)? Currently, it seems to replace missing values in all other columns too.
谢谢!两者都可以很好地与示例数据集配合使用。然而,对于第一种方法,我如何使其更有针对性(假设df1中有其他变量)?目前,它似乎也在替换所有其他列中缺失的值。
@MarshaT Can you please edit your question and put there sample input/example output with new dataset?
@MarshaT能否请您编辑您的问题,并使用新数据集放置示例输入/示例输出?
我是一名优秀的程序员,十分优秀!