gpt4 book ai didi

python - 根据另一个数据框中的行查询一个数据框行并比较值

转载 作者:行者123 更新时间:2023-12-05 05:36:08 25 4
gpt4 key购买 nike

所以我有两个数据框。第一个数据帧包含用于“评分”第二个数据帧的数值数据,第二个数据帧包含模拟数据。

df1 = 基本记录

df2 = 模拟记录

第 1 部分:我想要完成的是查询 df1“基本记录”以查找与 df2“模拟记录”中的时间戳最近的行,其中 “名称”和“时间”列完全匹配。

第 2 部分:然后我想使用 if then 函数来确定模拟记录行中的值是否落在使用基本记录行中的两个值创建的范围之间并返回一个 bool 值。

低范围 = df1['Po']-df1['Ref']

高范围 = df1['Po']+df1['Ref']

如果 df2['Sim'] 落在其最近的 df1 基本记录的低范围和高范围之间,那么我想在新列“Sim Score”中返回 true否则返回 false

第 3 部分:我想为模拟记录中的每一行重复第 1 部分和第 2 部分。

有用信息:

  • df1(基本记录)的行数多于或少于 df2(模拟记录)
  • df1 的列数比 df2 多
  • df1 中的某些列在 df2 中名称相同但值不同
  • 理想情况下希望能够对两个数据帧进行切片,其中 if then 函数只看到比较中使用的两行
  • 只需要最近的df1 base record与df2 simulation record进行比较
  • 之前在 google 工作表中使用 if then 和查询组合公式将整个工作表拖下(想用 python 和 pandas 替换)
df1 base records example (columns that matter)

Timestamp Name Time Po Ref

7/11/2022 11:30:00 trial 20 mins 5 2

7/10/2022 04:00:00 trial 20 mins 4 4

7/09/2022 02:45:00 trial 20 mins 2 2

6/28/2022 03:45:00 trial 20 mins 3 6
df2 simulation records example (columns that matter)

Timestamp Name Time Sim

7/10/2022 05:15:00 trial 20 mins 7

7/11/2022 12:45:00 trial 20 mins 4

7/12/2022 03:30:00 trial 20 mins 8
desired result of new column added to df2

Timestamp Name Time Sim Sim Score

7/10/2022 05:15:00 trial 20 mins 7 True

7/11/2022 12:45:00 trial 20 mins 4 True

7/12/2022 03:30:00 trial 20 mins 8 False

最佳答案

使用 pandas.DataFrame.reindex , 它的 method提供最近的来找到可计算的索引(例如,字符串不能计算距离)

或使用 merge_asof , 它的 direction提供最近的





方法一:

reindex()method='nearest'

df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
print(df1)
###
Name Time Po Ref l_r h_r
Timestamp
2022-07-11 11:30:00 trial 20 mins 5 2 3 7
2022-07-10 04:00:00 trial 20 mins 4 4 0 8
2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
df2.set_index('Timestamp', inplace=True)
print(df2)
###
Name Time Sim
Timestamp
2022-07-10 05:15:00 trial 20 mins 7
2022-07-11 12:45:00 trial 20 mins 4
2022-07-12 03:30:00 trial 20 mins 8
temp = df2.join(df1.reindex(df2.index, method='nearest'), lsuffix='_left', rsuffix='_right')
print(temp)

enter image description here

如您所见,这是 df2.join(df1) ,

join multiple DataFrame objects by index at once.

method='nearest' ,在这种情况下,它将加入 df2df1最近的Timestamp索引。


df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
df2.reset_index(inplace=True)
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False




方法二:

merge_asof()direction='nearest'这种方式不使用索引值执行,因此我们不必设置列 Timestamp作为索引。但它需要对绑定(bind)对象(在本例中我们合并到列 Timestamp )进行排序。

df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
# df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
3 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
2 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
1 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
# df2.set_index('Timestamp', inplace=True)
df2.sort_values(by='Timestamp', inplace=True)
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8

temp = pd.merge_asof(df2 ,df1[['Timestamp', 'l_r', 'h_r']], on='Timestamp', direction='nearest')
print(temp)

enter image description here如您所见,这是 pd.merge_asof(df2, df1) ,

This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.

对于左侧 DataFrame 中的每一行:

A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False

坦率地说,如果你有一个大数据集,处理索引的东西会更快。




方法二(多键)

我修改了df1添加不同的名称和时间

df1 = pd.DataFrame({'Timestamp':['7/11/2022 11:30:00','7/11/2022 11:30:00','7/10/2022 04:00:00','7/10/2022 04:00:00','7/09/2022 02:45:00','6/28/2022 03:45:00'],
'Name':['trial','trial','trial','non-trial','trial','trial'],
'Time':['20 mins','30 mins','20 mins','20 mins','20 mins','20 mins'],
'Po':[5, 6, 4, 1, 2, 3],
'Ref':[2, 2, 4, 3, 2, 6]})
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
5 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
4 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
3 2022-07-10 04:00:00 non-trial 20 mins 1 3 -2 4
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
1 2022-07-11 11:30:00 trial 30 mins 6 2 4 8
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8

重要:

只能在单个键上合并_asof,因此其他人会利用 by=处理。

temp = pd.merge_asof(df2, df1[['Timestamp', 'Name', 'Time', 'l_r', 'h_r']], on='Timestamp', by=['Name','Time'], direction='nearest')
print(temp)

enter image description here

df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False

引用:
pandas.DataFrame.join
pandas.merge_asof
merging/join concept

关于python - 根据另一个数据框中的行查询一个数据框行并比较值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73346402/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com