gpt4 book ai didi

python - 比较两个 DataFrame 并循环遍历它们(以测试条件)

转载 作者:行者123 更新时间:2023-12-05 09:11:07 27 4
gpt4 key购买 nike

我正在尝试根据条件“加入”两个 DataFrame。

条件

if df1.Year == df2.Year & 
df1.Date >= df2.BeginDate or df1.Date <= df2.EndDate &
df1.ID == df2.ID
#if the condition is True, I would love to add an extra column (binary) to df1, something like
#df1.condition = Yes or No.

我的数据是这样的:

df1: 

Year Week ID Date
2020 1 123 2020-01-01 00:00:00
2020 1 345 2020-01-01 00:00:00
2020 2 123 2020-01-07 00:00:00
2020 1 123 2020-01-01 00:00:00


df2:

Year BeginDate EndDate ID
2020 2020-01-01 00:00:00 2020-01-02 00:00:00 123
2020 2020-01-01 00:00:00 2020-01-02 00:00:00 123
2020 2020-01-01 00:00:00 2020-01-02 00:00:00 978
2020 2020-09-21 00:00:00 2020-01-02 00:00:00 978


end_df: #Expected output

Year Week ID Condition
2020 1 123 True #Year is matching, week1 is between the dates, ID is matching too
2019 1 345 False #Year is not matching
2020 2 187 False # ID is not matching
2020 1 123 True # Same as first row.

我想通过循环两个 DataFrame 来解决这个问题:

for row in df1.iterrrows(): 
for row2 in df2.iterrows():
if row['Year'] == row2['Year2']:
if row['ID] == row2['ID']:
.....
.....
row['Condition'] = True
else:
row['Condition'] = False

但是……这会导致一个又一个错误。

非常期待你们如何解决这个问题。提前谢谢了!

更新 1

我创建了一个循环。但是,这个循环需要很长时间(而且我不确定如何将值添加到新列)。

请注意,在 df1 中我创建了一个“日期”列(格式与 df2 中的开始和结束日期相同)。

现在关键点:如何将 True 值(在循环的末尾......)添加到我的 df1(在额外的列中)?

for index, row in df1.interrows(): 
row['Year'] = str(row['Year'])

for index1, row1 in df2.iterrows():
row1['Year'] = str(row1['Year'])


if row['Year'] == row1['Year']:
row['ID'] = str(row['ID'])
row1['ID'] = str(row1['ID'])


if row['ID] == row1['ID']:

if row['Date'] >= row1['BeginDate'] and row['Date'] <= row1['Enddate']:
print("I would like to add this YES to df1 in an extra column")

编辑2

尝试@davidbilla 解决方案:“条件”列看起来效果不佳。如您所见,它甚至在 df1.Year != df2.Year 时也匹配。请注意,df2 是根据 ID 排序的(因此所有相同的唯一编号都应该在那里

It looks like the 'condition' column is not doing well. As you can see, it match even while df1.Year != df2.Year. Note that df2 is sorted based on ID (so all the same unique numbers should be there

最佳答案

我猜你期待这样的事情 - 如果你试图逐行匹配数据帧(即将 df1 的 row1 与 df2 的 row1 进行比较):

df1['condition'] = np.where((df1['Year']==df2['Year'])&(df1['ID']==df2['ID'])&((df1['Date']>=df2['BeginDate'])or(df1['Date']<=df2['EndDate'])), True, False)

np.where 将条件作为第一个参数,第二个参数是条件通过时的值,第三个参数是条件失败时的值。

编辑 1:基于您的示例数据集

df1 = pd.DataFrame([[2020,1,123],[2020,1,345],[2020,2,123],[2020,1,123]],
columns=['Year','Week','ID'])
df2 = pd.DataFrame([[2020,'2020-01-01 00:00:00','2020-01-02 00:00:00',123],
[2020,'2020-01-01 00:00:00','2020-01-02 00:00:00',123],
[2020,'2020-01-01 00:00:00','2020-01-02 00:00:00',978],
[2020,'2020-09-21 00:00:00','2020-01-02 00:00:00',978]],
columns=['Year','BeginDate','EndDate','ID'])
df2['BeginDate'] = pd.to_datetime(df2['BeginDate'])
df2['EndDate'] = pd.to_datetime(df2['EndDate'])
df1['condition'] = np.where((df1['Year']==df2['Year'])&(df1['ID']==df2['ID']),True, False)
# &((df1['Date']>=df2['BeginDate'])or(df1['Date']<=df2['EndDate'])) - removed this condition as the df has no Date field
print(df1)

输出:

   Year  Date   ID  condition
0 2020 1 123 True
1 2020 1 345 False
2 2020 2 123 False
3 2020 1 123 False

编辑 2: 比较 df1 中的一行与 df2 中的所有行

df1['condition'] = (df1['Year'].isin(df2['Year']))&(df1['ID'].isin(df2['ID']))

这需要 df1['Year'] 并将其与 df2['Year'] 的所有值进行比较。

基于示例数据集:

df1:

   Year       Date   ID  
0 2020 2020-01-01 123
1 2020 2020-01-01 345
2 2020 2020-10-01 123
3 2020 2020-11-13 123

df2:

   Year  BeginDate    EndDate   ID
0 2020 2020-01-01 2020-02-01 123
1 2020 2020-01-01 2020-01-02 123
2 2020 2020-03-01 2020-05-01 978
3 2020 2020-09-21 2020-10-01 978

代码更改:

date_range = list(zip(df2['BeginDate'],df2['EndDate']))

def check_date(date):
for (s,e) in date_range:
if date>=s and date<=e:
return True
return False

df1['condition'] = (df1['Year'].isin(df2['Year']))&(df1['ID'].isin(df2['ID']))
df1['date_compare'] = df1['Date'].apply(lambda x: check_date(x)) # you can directly store this in df1['condition']. I just wanted to print the values so have used a new field
df1['condition'] = (df1['condition']==True)&(df1['date_compare']==True)

输出:

   Year       Date   ID  condition  date_compare
0 2020 2020-01-01 123 True True # Year match, ID match and Date is within the range of df2 row 1
1 2020 2020-01-01 345 False True # Year match, ID no match
2 2020 2020-10-01 123 True True # Year match, ID match, Date is within range of df2 row 4
3 2020 2020-11-13 123 False False # Year match, ID match, but Date is not in range of any row in df2

编辑 3:基于更新的问题(之前我认为如果 3 个值年份、ID 和日期在不在同一行的任何行中匹配 df2 是可以的)。我想我现在对您的要求有了更好的理解。

df2['BeginDate'] = pd.to_datetime(df2['BeginDate'])
df2['EndDate'] = pd.to_datetime(df2['EndDate'])
df1['Date'] = pd.to_datetime(df1['Date'])

df1['condition'] = False
for idx1, row1 in df1.iterrows():
match = False
for idx2, row2 in df2.iterrows():
if (row1['Year']==row2['Year']) & \
(row1['ID']==row2['ID']) & \
(row1['Date']>=row2['BeginDate']) & \
(row1['Date']<=row2['EndDate']):
match = True
df1.at[idx1, 'condition'] = match

输出 - 第 1 组:

DF1:

   Year       Date   ID
0 2020 2020-01-01 123
1 2020 2020-01-01 123
2 2020 2020-01-01 345
3 2020 2020-01-10 123
4 2020 2020-11-13 123

DF2:

   Year  BeginDate    EndDate   ID
0 2020 2020-01-15 2020-02-01 123
1 2020 2020-01-01 2020-01-02 123
2 2020 2020-03-01 2020-05-01 978
3 2020 2020-09-21 2020-10-01 978

DF1 结果:

   Year       Date   ID  condition
0 2020 2020-01-01 123 True
1 2020 2020-01-01 123 True
2 2020 2020-01-01 345 False
3 2020 2020-01-10 123 False
4 2020 2020-11-13 123 False

输出 - 第 2 组:DF1:

   Year       Date        ID
0 2019 2019-01-01 s904112
1 2019 2019-01-01 s911243
2 2019 2019-01-01 s917131
3 2019 2019-01-01 sp986214
4 2019 2019-01-01 s510006
5 2020 2020-01-10 s540006

DF2:

   Year  BeginDate    EndDate       ID
0 2020 2020-01-27 2020-09-02 s904112
1 2020 2020-01-27 2020-09-02 s904112
2 2020 2020-01-03 2020-03-15 s904112
3 2020 2020-04-15 2020-01-05 s904112
4 2020 2020-01-05 2020-05-15 s540006
5 2019 2019-01-05 2019-05-15 s904112

DF1 结果:

   Year       Date        ID  condition
0 2019 2019-01-01 s904112 False
1 2019 2019-01-01 s911243 False
2 2019 2019-01-01 s917131 False
3 2019 2019-01-01 sp986214 False
4 2019 2019-01-01 s510006 False
5 2020 2020-01-10 s540006 True

关于python - 比较两个 DataFrame 并循环遍历它们(以测试条件),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60551236/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com