gpt4 book ai didi

Impute missing values using group-level summary statistics from a different dataframe(使用来自不同数据帧的组级汇总统计信息来计算缺失值)

转载 作者:bug小助手 更新时间:2023-10-28 11:51:44 26 4
gpt4 key购买 nike



I want to impute missing values using the grouped summary statistics based on a different dataframe. For instance, I would like to impute missing values in numvar_original in df1 to be like numvar_ideal where the missing values are based on group-level means from df2:

我想使用基于不同数据帧的分组汇总统计信息来计算缺失值。例如,我想将df1中的numvar_Origin中的缺失值归结为与Numvar_Ideas类似,其中缺失值基于df2中的组级别平均值:


df1
catvar numvar_original numvar_ideal
1 10 10
1 NaN 5.5
2 30 30
2 NaN 6.5


df2
catvar numvar_original
1 5
1 6
2 6
2 7

# I tried the following:
df1['numvar'].fillna(df2.groupby('catvar')['numvar'].transform('mean'), inplace=True)
# The missing values weren't replaced

df1['numvar'].fillna(df1.groupby('catvar')['numvar'].transform('mean'), inplace=True)
# I checked that this works in filling up the missing values but I have to use group-level mean values from df1


Is there a way to do this without resorting to a combination of apply or map with a dictionary (since I've read that apply/map can be slower with larger datasets)?

有没有一种方法可以做到这一点,而不必求助于apply或map与字典的组合(因为我已经读到apply/map在较大的数据集上会更慢)?


更多回答

Can you please edit your question and put there sample input and example output? (in text form)

你可以编辑你的问题,把样本输入和样本输出放在一起吗?(文本形式)

I've edited to include the sample input and intended output

我已经进行了编辑,以包括样例输入和预期输出

优秀答案推荐

Try:

尝试:


means = df2.groupby("catvar")["numvar_original"].mean().to_dict()

df1 = df1.groupby("catvar", group_keys=False).apply(
lambda x: x.fillna(means[x["catvar"].iloc[0]])
)
print(df1)

Prints:

打印:


   catvar  numvar_original  numvar_ideal
0 1 10.0 10.0
1 1 5.5 5.5
2 2 30.0 30.0
3 2 6.5 6.5



OR: Using .merge:

或者:使用.merge:


means = df2.groupby("catvar")["numvar_original"].mean()

df1["numvar_original"] = df1["numvar_original"].fillna(
df1.merge(means, on="catvar")["numvar_original_y"]
)
print(df1)

Prints:

打印:


   catvar  numvar_original  numvar_ideal
0 1 10.0 10.0
1 1 5.5 5.5
2 2 30.0 30.0
3 2 6.5 6.5

更多回答

Thanks! Both work well with the sample dataset. However, for the first approach, how can I make it more targeted (assuming there are other variables in df1)? Currently, it seems to replace missing values in all other columns too.

谢谢!两者都可以很好地与示例数据集配合使用。然而,对于第一种方法,我如何使其更有针对性(假设df1中有其他变量)?目前,它似乎也在替换所有其他列中缺失的值。

@MarshaT Can you please edit your question and put there sample input/example output with new dataset?

@MarshaT能否请您编辑您的问题,并使用新数据集放置示例输入/示例输出?

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com