I am trying to update the missing values of a dataframe in pandas with a smaller subset but cannot seem to get pd.merge, df.loc or pd.join to work.
我正在尝试用较小的子集更新熊猫中数据帧的缺失值,但似乎无法使pd.merge、df.loc或pd.Join起作用。
The scenario is like this: I have a Dataframe df
such that:
场景如下:我有一个Dataframe DF,这样:
df = pd.DataFrame({"EmpId":[1,2,3,...,99,100],
"Name":['Fred','Barney','Wilma',...,'Bam-Bam','Pebbles'],
"Age":[40,35,NaN,...,5,NaN]}
And I get a new dataframe df1
like:
我得到了一个新的数据帧df1,如:
df1 = pd.DataFrame({"EmpId":[3,...,100],
"Age":[30,...,6]})
The "EmpId"'s in df1
are a non-sequential set of id's which exist in df
with "Age" values which are NaN
in df
. I am trying to fill the missing entries in df
without duplicating or otherwise affecting the existing values.
I have tried pd.merge
, which tries to add df1
as new columns in df
(even when using suffixes=(False,False)
, pd.join
has a similar effect)
I have tried using df.loc[df.EmpId == df1.EmpId, 'Age'] = df1.loc[df1.EmpId == df.EmpId, 'Age']
but whilst I can parse the information I require, won't seem to update df
, it continues to have the NaN
values.
I have tried df.update(df1)
but get a Value Error.
I've even tried a for...if...
construct with df.loc
but none of these seem to work as I intend.
df
and df1
have different shapes.
If anyone has any ideas where I'm going wrong, I would appreciate your input.
Df1中的“EmpID”S是一组不连续的id,它们存在于df中,具有在df中为NaN的“Age”值。我正在尝试在不复制或以其他方式影响现有值的情况下填充df中缺少的条目。我尝试过pd.merge,它试图将df1作为新列添加到df中(即使使用Suffixs=(FALSE,FALSE),pd.Join也有类似的效果)。我已经尝试使用df.loc[df.EmpID==df1.EmpId,‘Age’]=df1.loc[df1.EmpID==df.EmpID,‘Age’],但是虽然我可以解析所需的信息,但似乎不会更新df,它仍然具有NaN值。我尝试了df.update(Df1),但得到一个值错误。我甚至试过...如果..。使用df.loc构建,但这些似乎都不像我想要的那样工作。Df和df1具有不同的形状。如果任何人对我的错误之处有任何想法,我将感谢您的意见。
更多回答
Your minimal example should be compete, please do not use ...
and provide the exact matching expected output.
您的最小示例应该是竞争,请不要使用...并提供完全匹配的预期输出。
I think if you use set_index
to make "EmpId"
the index, then df.update(df1)
should work, but I'm not sure why it's giving a ValueError
now.
我认为如果您使用set_index将“EmpID”设置为索引,那么df.update(Df1)应该可以工作,但我不确定为什么它现在会给出一个ValueError。
优秀答案推荐
If I understand the problem correctly, you can use merge + additional operations on result columns, while making sure you don't change the original values with ifnull
function:
如果我正确理解了这个问题,您可以对结果列使用Merge+附加操作,同时确保不会使用ifull函数更改原始值:
df = pd.DataFrame({"EmpId":[1,2,3,99,100],
"Name":['Fred','Barney','Wilma','Bam-Bam','Pebbles'],
"Age":[40,35,np.NaN,5,np.NaN]})
df1 = pd.DataFrame({"EmpId":[3,100],
"Age":[30,6]})
df = df.merge(df1, on="EmpId", how="left", suffixes=("", "_filled"))
def ifnull(val, replace):
if val is None or pd.isna(val):
return replace
return val
df["Age"] = df[["Age", "Age_filled"]] \
.apply(lambda row: ifnull(row["Age"], row["Age_filled"]), axis=1)
df.drop("Age_filled", axis=1, inplace=True)
print(df)
Output:
产出:
EmpId Name Age
0 1 Fred 40.0
1 2 Barney 35.0
2 3 Wilma 30.0
3 99 Bam-Bam 5.0
4 100 Pebbles 6.0
This will only work if EmpId
are in fact unique, as in your example.
这只有在EmpID实际上是唯一的情况下才会起作用,如您的示例所示。
更多回答
我是一名优秀的程序员,十分优秀!