gpt4 book ai didi

python - Pandas 在最近的时间戳上合并数据帧

转载 作者:行者123 更新时间:2023-12-05 00:55:06 24 4
gpt4 key购买 nike

我想在三列上合并两个数据框:电子邮件、主题和时间戳。
数据帧之间的时间戳不同,因此我需要为一组电子邮件和主题确定最匹配的时间戳。

以下是使用为 this 建议的最接近匹配函数的可重现示例题。

import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO

def find_closest_date(timepoint, time_series, add_time_delta_column=True):
# takes a pd.Timestamp() instance and a pd.Series with dates in it
# calcs the delta between `timepoint` and each date in `time_series`
# returns the closest date and optionally the number of days in its time delta
deltas = np.abs(time_series - timepoint)
idx_closest_date = np.argmin(deltas)
res = {"closest_date": time_series.ix[idx_closest_date]}
idx = ['closest_date']
if add_time_delta_column:
res["closest_delta"] = deltas[idx_closest_date]
idx.append('closest_delta')
return pd.Series(res, index=idx)


a = """timestamp,email,subject
2016-07-01 10:17:00,a@gmail.com,subject3
2016-07-01 02:01:02,a@gmail.com,welcome
2016-07-01 14:45:04,a@gmail.com,subject3
2016-07-01 08:14:02,a@gmail.com,subject2
2016-07-01 16:26:35,a@gmail.com,subject4
2016-07-01 10:17:00,b@gmail.com,subject3
2016-07-01 02:01:02,b@gmail.com,welcome
2016-07-01 14:45:04,b@gmail.com,subject3
2016-07-01 08:14:02,b@gmail.com,subject2
2016-07-01 16:26:35,b@gmail.com,subject4
"""

b = """timestamp,email,subject,clicks,var1
2016-07-01 02:01:14,a@gmail.com,welcome,1,1
2016-07-01 08:15:48,a@gmail.com,subject2,2,2
2016-07-01 10:17:39,a@gmail.com,subject3,1,7
2016-07-01 14:46:01,a@gmail.com,subject3,1,2
2016-07-01 16:27:28,a@gmail.com,subject4,1,2
2016-07-01 10:17:05,b@gmail.com,subject3,0,0
2016-07-01 02:01:03,b@gmail.com,welcome,0,0
2016-07-01 14:45:05,b@gmail.com,subject3,0,0
2016-07-01 08:16:00,b@gmail.com,subject2,0,0
2016-07-01 17:00:00,b@gmail.com,subject4,0,0
"""

请注意,对于 a@gmail.com,最匹配的时间戳是 10:17:39,而对于 b@gmail.com,最匹配的时间戳是 10:17:05。
a = """timestamp,email,subject
2016-07-01 10:17:00,a@gmail.com,subject3
2016-07-01 10:17:00,b@gmail.com,subject3
"""

b = """timestamp,email,subject,clicks,var1
2016-07-01 10:17:39,a@gmail.com,subject3,1,7
2016-07-01 10:17:05,b@gmail.com,subject3,0,0
"""
df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
df2 = pd.read_csv(StringIO(b), parse_dates=['timestamp'])

df1[['closest', 'time_bt_x_and_y']] = df1.timestamp.apply(find_closest_date, args=[df2.timestamp])
df1

df3 = pd.merge(df1, df2, left_on=['email','subject','closest'], right_on=['email','subject','timestamp'],how='left')

df3
timestamp_x email subject closest time_bt_x_and_y timestamp_y clicks var1
2016-07-01 10:17:00 a@gmail.com subject3 2016-07-01 10:17:05 00:00:05 NaT NaN NaN
2016-07-01 02:01:02 a@gmail.com welcome 2016-07-01 02:01:03 00:00:01 NaT NaN NaN
2016-07-01 14:45:04 a@gmail.com subject3 2016-07-01 14:45:05 00:00:01 NaT NaN NaN
2016-07-01 08:14:02 a@gmail.com subject2 2016-07-01 08:15:48 00:01:46 2016-07-01 08:15:48 2.0 2.0
2016-07-01 16:26:35 a@gmail.com subject4 2016-07-01 16:27:28 00:00:53 2016-07-01 16:27:28 1.0 2.0
2016-07-01 10:17:00 b@gmail.com subject3 2016-07-01 10:17:05 00:00:05 2016-07-01 10:17:05 0.0 0.0
2016-07-01 02:01:02 b@gmail.com welcome 2016-07-01 02:01:03 00:00:01 2016-07-01 02:01:03 0.0 0.0
2016-07-01 14:45:04 b@gmail.com subject3 2016-07-01 14:45:05 00:00:01 2016-07-01 14:45:05 0.0 0.0
2016-07-01 08:14:02 b@gmail.com subject2 2016-07-01 08:15:48 00:01:46 NaT NaN NaN
2016-07-01 16:26:35 b@gmail.com subject4 2016-07-01 16:27:28 00:00:53 NaT NaN NaN

结果是错误的,主要是因为最近的日期不正确,因为它没有考虑电子邮件和主题。

预期的结果是

enter image description here

修改函数以提供给定电子邮件和主题的最接近的时间戳会很有帮助。
df1.groupby(['email','subject'])['timestamp'].apply(find_closest_date, args=[df1.timestamp])

但这会产生错误,因为该函数未为组对象定义。
这样做的最佳方法是什么?

最佳答案

请注意,如果您合并 df1df2emailsubject ,那么结果
具有所有可能的相关时间戳配对:

In [108]: result = pd.merge(df1, df2, how='left', on=['email','subject'], suffixes=['', '_y']); result
Out[108]:
timestamp email subject timestamp_y clicks var1
0 2016-07-01 10:17:00 a@gmail.com subject3 2016-07-01 10:17:39 1 7
1 2016-07-01 10:17:00 a@gmail.com subject3 2016-07-01 14:46:01 1 2
2 2016-07-01 02:01:02 a@gmail.com welcome 2016-07-01 02:01:14 1 1
3 2016-07-01 14:45:04 a@gmail.com subject3 2016-07-01 10:17:39 1 7
4 2016-07-01 14:45:04 a@gmail.com subject3 2016-07-01 14:46:01 1 2
5 2016-07-01 08:14:02 a@gmail.com subject2 2016-07-01 08:15:48 2 2
6 2016-07-01 16:26:35 a@gmail.com subject4 2016-07-01 16:27:28 1 2
7 2016-07-01 10:17:00 b@gmail.com subject3 2016-07-01 10:17:05 0 0
8 2016-07-01 10:17:00 b@gmail.com subject3 2016-07-01 14:45:05 0 0
9 2016-07-01 02:01:02 b@gmail.com welcome 2016-07-01 02:01:03 0 0
10 2016-07-01 14:45:04 b@gmail.com subject3 2016-07-01 10:17:05 0 0
11 2016-07-01 14:45:04 b@gmail.com subject3 2016-07-01 14:45:05 0 0
12 2016-07-01 08:14:02 b@gmail.com subject2 2016-07-01 08:16:00 0 0
13 2016-07-01 16:26:35 b@gmail.com subject4 2016-07-01 17:00:00 0 0

您现在可以获取每行时间戳差异的绝对值:
result['diff'] = (result['timestamp_y'] - result['timestamp']).abs()

然后使用
idx = result.groupby(['timestamp','email','subject'])['diff'].idxmin()
result = result.loc[idx]

根据 ['timestamp','email','subject'] 为每个组找到差异最小的行.
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO

a = """timestamp,email,subject
2016-07-01 10:17:00,a@gmail.com,subject3
2016-07-01 02:01:02,a@gmail.com,welcome
2016-07-01 14:45:04,a@gmail.com,subject3
2016-07-01 08:14:02,a@gmail.com,subject2
2016-07-01 16:26:35,a@gmail.com,subject4
2016-07-01 10:17:00,b@gmail.com,subject3
2016-07-01 02:01:02,b@gmail.com,welcome
2016-07-01 14:45:04,b@gmail.com,subject3
2016-07-01 08:14:02,b@gmail.com,subject2
2016-07-01 16:26:35,b@gmail.com,subject4
"""

b = """timestamp,email,subject,clicks,var1
2016-07-01 02:01:14,a@gmail.com,welcome,1,1
2016-07-01 08:15:48,a@gmail.com,subject2,2,2
2016-07-01 10:17:39,a@gmail.com,subject3,1,7
2016-07-01 14:46:01,a@gmail.com,subject3,1,2
2016-07-01 16:27:28,a@gmail.com,subject4,1,2
2016-07-01 10:17:05,b@gmail.com,subject3,0,0
2016-07-01 02:01:03,b@gmail.com,welcome,0,0
2016-07-01 14:45:05,b@gmail.com,subject3,0,0
2016-07-01 08:16:00,b@gmail.com,subject2,0,0
2016-07-01 17:00:00,b@gmail.com,subject4,0,0
"""

df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
df2 = pd.read_csv(StringIO(b), parse_dates=['timestamp'])

result = pd.merge(df1, df2, how='left', on=['email','subject'], suffixes=['', '_y'])
result['diff'] = (result['timestamp_y'] - result['timestamp']).abs()
idx = result.groupby(['timestamp','email','subject'])['diff'].idxmin()
result = result.loc[idx].drop(['timestamp_y','diff'], axis=1)
result = result.sort_index()
print(result)

产量
             timestamp        email   subject  clicks  var1
0 2016-07-01 10:17:00 a@gmail.com subject3 1 7
2 2016-07-01 02:01:02 a@gmail.com welcome 1 1
4 2016-07-01 14:45:04 a@gmail.com subject3 1 2
5 2016-07-01 08:14:02 a@gmail.com subject2 2 2
6 2016-07-01 16:26:35 a@gmail.com subject4 1 2
7 2016-07-01 10:17:00 b@gmail.com subject3 0 0
9 2016-07-01 02:01:02 b@gmail.com welcome 0 0
11 2016-07-01 14:45:04 b@gmail.com subject3 0 0
12 2016-07-01 08:14:02 b@gmail.com subject2 0 0
13 2016-07-01 16:26:35 b@gmail.com subject4 0 0

关于python - Pandas 在最近的时间戳上合并数据帧,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38807890/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com