gpt4 book ai didi

python - 如何在python中左连接2个数据帧,如果过滤后第二个数据框中有多个匹配行,则与第一行连接

转载 作者:太空宇宙 更新时间:2023-11-04 02:14:23 25 4
gpt4 key购买 nike

我有 2 个数据框,每个数据框都有一个数据类型为数据时间的列。我想在以下条件下加入第二个数据框

  1. 找到第一个数据帧的日期时间值介于第二个数据帧的日期时间值和之前 10 分钟之间的行

  2. 如果有多个这样的行,则取第一个

  3. 如果没有这样的行,则填充为空或null

  4. 一行只能合并一次。

现在我是按照下面的方式做的。我想知道,是否有更好的方法来减少总运行时间。

from datetime import datetime
import datetime as dt
import pandas as pd


df1 = pd.DataFrame(columns = ['Enter_Time', 'Unique_Id'])
df1.loc[len(df1)] = [datetime.strptime('2018-10-01 06:29:00','%Y-%m-%d %H:%M:%S'), 'A']
df1.loc[len(df1)] = [datetime.strptime('2018-10-01 06:30:00','%Y-%m-%d %H:%M:%S'), 'B']
df1.loc[len(df1)] = [datetime.strptime('2018-10-01 06:31:00','%Y-%m-%d %H:%M:%S'), 'C']
df1.loc[len(df1)] = [datetime.strptime('2018-10-01 06:32:00','%Y-%m-%d %H:%M:%S'), 'D']
df1.loc[len(df1)] = [datetime.strptime('2018-10-01 06:33:00','%Y-%m-%d %H:%M:%S'), 'E']
df1.loc[len(df1)] = [datetime.strptime('2018-10-01 08:29:00','%Y-%m-%d %H:%M:%S'), 'F']
df1.loc[len(df1)] = [datetime.strptime('2018-10-01 08:30:00','%Y-%m-%d %H:%M:%S'), 'G']
df1.loc[len(df1)] = [datetime.strptime('2018-10-01 08:31:00','%Y-%m-%d %H:%M:%S'), 'H']
df1.loc[len(df1)] = [datetime.strptime('2018-10-01 08:32:00','%Y-%m-%d %H:%M:%S'), 'I']
df1.loc[len(df1)] = [datetime.strptime('2018-10-01 08:33:00','%Y-%m-%d %H:%M:%S'), 'j']


df2 = pd.DataFrame(columns = ['Transaction_Time', 'Amount'])
df2.loc[len(df2)] = [datetime.strptime('2018-10-01 06:40:00','%Y-%m-%d %H:%M:%S'), 10.25]
df2.loc[len(df2)] = [datetime.strptime('2018-10-01 07:40:00','%Y-%m-%d %H:%M:%S'), 3.96]
df2.loc[len(df2)] = [datetime.strptime('2018-10-01 08:31:00','%Y-%m-%d %H:%M:%S'), 9.65]
df2.loc[len(df2)] = [datetime.strptime('2018-10-01 08:32:00','%Y-%m-%d %H:%M:%S'), 2.84]

df3 = pd.DataFrame(columns = ['Transaction_Time', 'Amount', 'Enter_Time', 'Unique_Id'])

for id, row in df2.iterrows():
Transaction_Time = row['Transaction_Time']
Transaction_Time_Before = Transaction_Time - dt.timedelta(seconds = 600)
Result_Row = {
'Transaction_Time' : row['Transaction_Time'],
'Amount' : row['Amount'],
'Enter_Time' : '',
'Unique_Id' : ''
}

dfFiletered = df1[(df1["Enter_Time"] < Transaction_Time) & (df1["Enter_Time"] >= Transaction_Time_Before)].sort_values(by= ['Enter_Time'],ascending=True)
if len(dfFiletered) > 0:
firstRow = dfFiletered.iloc[0]
Result_Row['Enter_Time'] = firstRow['Enter_Time']
Result_Row['Unique_Id'] = firstRow['Unique_Id']
df1.drop(df1[df1["Unique_Id"] == firstRow['Unique_Id']].index, inplace=True)
df3.loc[len(df3)] = Result_Row
print(df3)

最佳答案

您可以使用 merge_asof() :

pd.merge_asof(df1,
df2,
left_on='Enter_Time',
right_on='Transaction_Time',
tolerance=pd.Timedelta('10m'),
direction='forward')

它会产生:

#           Enter_Time Unique_Id    Transaction_Time  Amount
#0 2018-10-01 06:29:00 A NaT NaN
#1 2018-10-01 06:30:00 B 2018-10-01 06:40:00 10.25
#2 2018-10-01 06:31:00 C 2018-10-01 06:40:00 10.25
#3 2018-10-01 06:32:00 D 2018-10-01 06:40:00 10.25
#4 2018-10-01 06:33:00 E 2018-10-01 06:40:00 10.25
#5 2018-10-01 08:29:00 F 2018-10-01 08:31:00 9.65
#6 2018-10-01 08:30:00 G 2018-10-01 08:31:00 9.65
#7 2018-10-01 08:31:00 H 2018-10-01 08:31:00 9.65
#8 2018-10-01 08:32:00 I 2018-10-01 08:32:00 2.84
#9 2018-10-01 08:33:00 j NaT NaN

并且只保留第一次使用:

df = pd.merge_asof(df1,
df2,
left_on='Enter_Time',
right_on='Transaction_Time',
tolerance=pd.Timedelta('10m'),
direction='forward')

df.loc[df.duplicated(['Transaction_Time', 'Amount']), ['Transaction_Time', 'Amount']] = (np.nan, np.nan)
df
# Enter_Time Unique_Id Transaction_Time Amount
#0 2018-10-01 06:29:00 A NaT NaN
#1 2018-10-01 06:30:00 B 2018-10-01 06:40:00 10.25
#2 2018-10-01 06:31:00 C NaT NaN
#3 2018-10-01 06:32:00 D NaT NaN
#4 2018-10-01 06:33:00 E NaT NaN
#5 2018-10-01 08:29:00 F 2018-10-01 08:31:00 9.65
#6 2018-10-01 08:30:00 G NaT NaN
#7 2018-10-01 08:31:00 H NaT NaN
#8 2018-10-01 08:32:00 I 2018-10-01 08:32:00 2.84
#9 2018-10-01 08:33:00 j NaT NaN

编辑

要将 df2df1 合并,我想您需要保留默认方向('向后'):

df = pd.merge_asof(df2,
df1,
left_on='Transaction_Time',
right_on='Enter_Time',
tolerance=pd.Timedelta('10m'))

df.loc[df.duplicated(['Transaction_Time', 'Amount']), ['Transaction_Time', 'Amount']] = (np.nan, np.nan)
# Transaction_Time Amount Enter_Time Unique_Id
#0 2018-10-01 06:40:00 10.25 2018-10-01 06:33:00 E
#1 2018-10-01 07:40:00 3.96 NaT NaN
#2 2018-10-01 08:31:00 9.65 2018-10-01 08:31:00 H
#3 2018-10-01 08:32:00 2.84 2018-10-01 08:32:00 I

duplicated 的转换不会影响您的示例,但它可以解决有问题的问题。

关于python - 如何在python中左连接2个数据帧,如果过滤后第二个数据框中有多个匹配行,则与第一行连接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52972172/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com