gpt4 book ai didi

python-3.x - 如何在查看前几行值的每一行中执行代码的同时,高效地逐行遍历 pandas 数据框?

转载 作者:行者123 更新时间:2023-12-04 17:49:46 24 4
gpt4 key购买 nike

我正在寻找一种有效的方法来遍历数据框并执行代码,对于每一行,它根据过去或 future 行中的值执行某些操作。

我逐行(使用 for 循环)遍历日期时间索引的数据帧,该数据帧可能有超过 200'000 行。根据两列(Bi 和 Icats)之一中的值,我在第三列 (To_set) 中设置了一个值。在每一行中执行的代码包含一个条件,该条件使用当前索引和时间增量在前一行(在列 Bi 中)中查找值。

目前,循环遍历数据帧需要很长时间,我想知道是否有更快或更优雅的方法可用。

代码循环的数据帧具有三列(Bi、Icats、to_set)——下面是 df 的一部分。

注意:我的代码已经循环遍历 df 并在“To_set”列中设置值。没有值,因为我最初使用 None 而不是 pd.np.nan 初始化该列。

                        Bi     Icats     To_set
2014-11-28 10:17:00 NaN NaN None
2014-11-28 10:30:00 NaN 0.040220 0.04022
2014-11-28 10:32:00 NaN NaN None
2014-11-28 10:35:00 0.217 NaN 0.217
2014-11-28 10:38:00 0.365 NaN 0.365
2014-11-28 10:44:00 0.227 NaN 0.227
2014-11-28 10:45:00 NaN 0.040220 None
2014-11-28 10:47:00 0.149 NaN 0.149
2014-11-28 10:50:00 0.109 NaN 0.109
2014-11-28 10:56:00 NaN NaN None
2014-11-28 10:59:00 0.065 NaN 0.065
2014-11-28 11:00:00 NaN 0.063687 None
2014-11-28 11:14:00 NaN NaN None
2014-11-28 11:15:00 NaN 0.047007 0.0470067
2014-11-28 11:30:00 NaN 0.041165 0.041165
2014-11-28 11:35:00 NaN NaN None
2014-11-28 11:45:00 NaN 0.040600 0.0406
2014-11-28 12:00:00 NaN 0.039667 0.0396667
2014-11-28 12:15:00 NaN 0.039460 0.03946
2014-11-28 12:30:00 NaN 0.038955 0.038955

目前执行循环的代码如下所示:

注意 - 'Bi' 的列索引为 3,'Icats' 为 4,'To_set' 为 5

df['New'] = pd.np.nan

for i in range(len(df)):
if pd.notnull(df.iloc[i,3]):
# if there is a value in Bi, take it always
df.iloc[i,5] = df.iloc[i,3]
continue
if pd.notnull(df.iloc[i,4]):
# take icats value only if there was no Bi value in the
# past 10 mins
# --> find the index of the last Bi value with last_valid_index() (a),
# and if timedelta between (a) and i > 10 mins, take Icats value
try:
if df.iloc[:i,3].last_valid_index() < (df.index[i].to_pydatetime() -
datetime.timedelta(minutes=10)):
# last bi value is older than startTime.
# Take the current icats value
df.iloc[i,5] = df.iloc[i,4]
except TypeError:
df.iloc[i,5] = df.iloc[i,4]
# have to include a try statement because until code hits the first real value
# in bi, the if condition below throws an error

是否有更好或更优雅的方法来逐行迭代数据框,以便访问前一行或后几行中的值?我知道有类似 df.itertuples() 的东西,但我认为这不能让我查看之前的行。

编辑:

我重写了代码,使其不需要查看之前的行,而是将之前行中所需的所有信息保存在变量中。显然,这运行得更快。通过这种方式,我可能可以使用 df.itertuples() 来进一步加快代码速度。但是,我最初的问题仍然存在:是否有一种优雅的方式来遍历数据帧并在其中使用值条件语句的前几行?

firstBiValueMet = False
for i in range(len(df)):
if pd.notnull(df.iloc[i,3]):
# save time for future calculations
firstBiValueMet = True
lastTime = df.index[i].to_pydatetime()
# if there is a value in Bi, take it always
df.iloc[i,5] = df.iloc[i,3]
continue
if pd.notnull(df.iloc[i,4]) and firstBiValueMet == False:
# in this case, take icats value anyway
df.iloc[i,5] = df.iloc[i,4]
if pd.notnull(df.iloc[i,4]) and firstBiValueMet == True and df.index[i] - lastTime > datetime.timedelta(minutes=10):
# take icats value only if there was no Bi value in the
# past 10 mins
df.iloc[i,5] = df.iloc[i,4]
if i%15000 == 0:
print(i)

最佳答案

None 出现在 To_set 列中对您有多重要?

这个问题很难在 for 循环中完成,因为将 To_set 设置为什么的决定取决于之前与时间相关的行条件。

这是一种不依赖于循环的“开箱即用”方法。它也没有 None 的概念作为 To_set 的值,而只是保留当前 To_set 值的运行记录.

DataFrame 娱乐

import pandas as pd
import numpy as np

timestamps = [pd.Timestamp('2014-11-28 10:17:00'), pd.Timestamp('2014-11-28 10:30:00'), pd.Timestamp('2014-11-28 10:32:00'), pd.Timestamp('2014-11-28 10:35:00'), pd.Timestamp('2014-11-28 10:38:00'), pd.Timestamp('2014-11-28 10:44:00'), pd.Timestamp('2014-11-28 10:45:00'), pd.Timestamp('2014-11-28 10:47:00'), pd.Timestamp('2014-11-28 10:50:00'), pd.Timestamp('2014-11-28 10:56:00'), pd.Timestamp('2014-11-28 10:59:00'), pd.Timestamp('2014-11-28 11:00:00'), pd.Timestamp('2014-11-28 11:14:00'), pd.Timestamp('2014-11-28 11:15:00'), pd.Timestamp('2014-11-28 11:30:00'), pd.Timestamp('2014-11-28 11:35:00'), pd.Timestamp('2014-11-28 11:45:00'), pd.Timestamp('2014-11-28 12:00:00'), pd.Timestamp('2014-11-28 12:15:00'), pd.Timestamp('2014-11-28 12:30:00')]

data = {'Bi': [np.nan, np.nan, np.nan, 0.217, 0.365, 0.22699999999999998, np.nan, 0.149, 0.109,
np.nan, 0.065, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Dummy1': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Dummy2': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Dummy3': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,
np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Icats': [np.nan, 0.04022, np.nan, np.nan, np.nan, np.nan, 0.04022, np.nan, np.nan, np.nan, np.nan,
0.063687, np.nan, 0.047007, 0.041165, np.nan, 0.0406, 0.039667, 0.03946,
0.038955000000000004],
'To_set': ['None', 0.040219999999999999, 'None', '0.217', '0.365', '0.227',
'None', '0.149', '0.109', 'None', '0.065', 'None', 'None',
'0.0470067', '0.041165', 'None', '0.0406', '0.0396667', '0.03946',
'0.038955']}

columns = ['Dummy1', 'Dummy2', 'Dummy3', 'Bi', 'Icats', 'To_set']

original_df = pd.DataFrame(data, index=timestamps, columns=columns)

original_df 看起来像这样:

                     Dummy1  Dummy2  Dummy3     Bi     Icats     To_set
2014-11-28 10:17:00 NaN NaN NaN NaN NaN None
2014-11-28 10:30:00 NaN NaN NaN NaN 0.040220 0.04022
2014-11-28 10:32:00 NaN NaN NaN NaN NaN None
2014-11-28 10:35:00 NaN NaN NaN 0.217 NaN 0.217
2014-11-28 10:38:00 NaN NaN NaN 0.365 NaN 0.365
2014-11-28 10:44:00 NaN NaN NaN 0.227 NaN 0.227
2014-11-28 10:45:00 NaN NaN NaN NaN 0.040220 None
2014-11-28 10:47:00 NaN NaN NaN 0.149 NaN 0.149
2014-11-28 10:50:00 NaN NaN NaN 0.109 NaN 0.109
2014-11-28 10:56:00 NaN NaN NaN NaN NaN None
2014-11-28 10:59:00 NaN NaN NaN 0.065 NaN 0.065
2014-11-28 11:00:00 NaN NaN NaN NaN 0.063687 None
2014-11-28 11:14:00 NaN NaN NaN NaN NaN None
2014-11-28 11:15:00 NaN NaN NaN NaN 0.047007 0.0470067
2014-11-28 11:30:00 NaN NaN NaN NaN 0.041165 0.041165
2014-11-28 11:35:00 NaN NaN NaN NaN NaN None
2014-11-28 11:45:00 NaN NaN NaN NaN 0.040600 0.0406
2014-11-28 12:00:00 NaN NaN NaN NaN 0.039667 0.0396667
2014-11-28 12:15:00 NaN NaN NaN NaN 0.039460 0.03946
2014-11-28 12:30:00 NaN NaN NaN NaN 0.038955 0.038955

这是下一部分的代码,然后我会解释它:

df = original_df.copy()
df.drop('To_set', axis=1, inplace=True)

new_index = pd.DatetimeIndex(start=df.index.min(), end=df.index.max(), freq='1min')
df = df.reindex(new_index)
df['Bi'] = df['Bi'].ffill(limit=10)
df['To_set_NEW'] = df['Bi'].combine_first(df['Icats']).ffill()
compare_df = df.loc[original_df.index]
  1. 复制原始数据框并将其命名为df
  2. df 中删除 To_set
  3. reindex df 使用新索引填充缺失的时间段,频率为 1 分钟。如果你的 df 超过了很长一段时间,这种方法可能会很糟糕 :) 因为它会在每一天的每一分钟填写一行。如果没有内存错误,继续...
  4. Forword-fill column Bi 但限制为最多 10 次填充。
  5. 使用combine_first 设置BiIcats。这是有效的,因为如果 Bi 没有被向前填充 10 分钟,并且 Icats 有一个值,Icats 值将被选中。<
  6. compare_dforiginal_df 进行比较,以评估它是否符合您的要求。

您可以将输出与此进行比较:

output = pd.DataFrame({'To_set': original_df['To_set'], 'To_set_NEW': compare_df['To_set_NEW']})

输出看起来像这样:

                        To_set  To_set_NEW
2014-11-28 10:17:00 None NaN
2014-11-28 10:30:00 0.04022 0.040220
2014-11-28 10:32:00 None 0.040220
2014-11-28 10:35:00 0.217 0.217000
2014-11-28 10:38:00 0.365 0.365000
2014-11-28 10:44:00 0.227 0.227000
2014-11-28 10:45:00 None 0.227000
2014-11-28 10:47:00 0.149 0.149000
2014-11-28 10:50:00 0.109 0.109000
2014-11-28 10:56:00 None 0.109000
2014-11-28 10:59:00 0.065 0.065000
2014-11-28 11:00:00 None 0.065000
2014-11-28 11:14:00 None 0.065000
2014-11-28 11:15:00 0.0470067 0.047007
2014-11-28 11:30:00 0.041165 0.041165
2014-11-28 11:35:00 None 0.041165
2014-11-28 11:45:00 0.0406 0.040600
2014-11-28 12:00:00 0.0396667 0.039667
2014-11-28 12:15:00 0.03946 0.039460
2014-11-28 12:30:00 0.038955 0.038955

所有这些都是最佳实践吗?

可能不是,但这是另一种看待它的方式。 np.where(cond, what to do if true, else this) 在这里也很方便。问题是您根据当前行的时间戳将滚动时间限制为 10 分钟。也许其他人有更好的主意!

关于python-3.x - 如何在查看前几行值的每一行中执行代码的同时,高效地逐行遍历 pandas 数据框?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45993156/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com