gpt4 book ai didi

python - 解析日期列后 pandas 合并功能出现问题

转载 作者:太空宇宙 更新时间:2023-11-03 14:01:30 30 4
gpt4 key购买 nike

我得到了以下两个数据框:

df1 = pd.DataFrame({'date':['2012-12-31', '2013-12-31', '9999-12-31'],
'value':[4, 5, 6]})

df2 = pd.DataFrame({'date':['2013-12-31', '2012-12-31', '2010-12-31'],
'value':[14, 55, 36]})

df1的问题是['date]列包含一个无法直接解析为时间戳的值。所以我使用了以下函数:

def to_datetime(x):
try:
res = pd.to_datetime(x)
except:
res = x
return res

然后我的新列如下:

df1['date_new'] = df1['date'].apply(to_datetime)
df2['date_new'] = df2['date'].apply(to_datetime)

我想合并 ['date_new'] 上的两个数据帧,但没有匹配的值。

df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])

但是,

df1['date_new'][0] == df2['date_new'][1]

返回。完整代码如下:

import pandas as pd

def to_datetime(x):
try:
res = pd.to_datetime(x)
except:
res = x
return res

df1 = pd.DataFrame({'date':['2012-12-31', '2013-12-31', '9999-12-31'],
'value':[4, 5, 6]})

df2 = pd.DataFrame({'date':['2013-12-31', '2012-12-31', '2010-12-31'],
'value':[14, 55, 36]})

df1['date_new'] = df1['date'].apply(to_datetime)
df2['date_new'] = df2['date'].apply(to_datetime)

df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])

请告诉我为什么会发生这种情况。谢谢!

最佳答案

pd.to_datetime 有方便的errors 参数,您可以将其设置为coerce。然后,您的代码似乎可以工作:

df1['date_new'] = pd.to_datetime(df1['date'], errors='coerce')
df2['date_new'] = pd.to_datetime(df2['date'], errors='coerce')

df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])


>>> df3
date_x value_x date_new date_y value_y
0 2012-12-31 4 2012-12-31 2012-12-31 55
1 2013-12-31 5 2013-12-31 2013-12-31 14

请注意,由于您的日期是强制的,如果它们不适合日期格式,它们将显示为 NaT,因此这些强制值在合并时会匹配。例如:

df1 = pd.DataFrame({'date':['2012-12-31', '2013-12-31', '9999-12-31','xyz'],
'value':[4, 5, 6, 14]})

df2 = pd.DataFrame({'date':['2013-12-31', '2012-12-31', '2010-12-31','sss'],
'value':[14, 55, 36, 12]})

df1['date_new'] = pd.to_datetime(df1['date'], errors='coerce')
df2['date_new'] = pd.to_datetime(df2['date'], errors='coerce')

df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])

导致这个:

>>> df3
date_x value_x date_new date_y value_y
0 2012-12-31 4 2012-12-31 2012-12-31 55
1 2013-12-31 5 2013-12-31 2013-12-31 14
2 9999-12-31 6 NaT sss 12
3 xyz 14 NaT sss 12

为了避免这些问题,您可以合并数据帧的子集,其中 date_new 不为空:

df3 = pd.merge(df1.loc[df1.date_new.notnull()], df2.loc[df2.date_new.notnull()], how = 'inner', on = ['date_new'])

至于为什么发生在您的代码中,如果您的函数发现不合适的日期,它最终会返回一系列dtype: object :

df1['date_new'] = df1['date'].apply(to_datetime)
>>> df1['date_new']
0 2012-12-31 00:00:00
1 2013-12-31 00:00:00
2 9999-12-31
Name: date_new, dtype: object

但是当所有日期都正常时,一系列类型dtype: datetime64[ns]:

df2['date_new'] = df2['date'].apply(to_datetime)
>>> df2['date_new']
0 2013-12-31
1 2012-12-31
2 2010-12-31
Name: date_new, dtype: datetime64[ns]

所以这些不能正确合并

关于python - 解析日期列后 pandas 合并功能出现问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49216237/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com