gpt4 book ai didi

python - 使用键重新组合 pandas 中的数据框。比迭代行更快的方法?

转载 作者:行者123 更新时间:2023-12-01 00:08:34 24 4
gpt4 key购买 nike

摘要:

我想将代表 Action 开始和结束的时间序列代码(大数据集)排列成甘特图,因此我需要将它们重新分组为任务(名称)、开始(时间)和完成(时间)列。然而,到目前为止我只能用 for 循环非常缓慢地迭代每一行:(

(我一直在尝试groupby和pivot,但我还没有很好地掌握这些功能,无法让它们做我想要的事情。)

我有一个“关键”字典/df,其中包含 start_codeend_code 和操作标签。简化示例:

import pandas as pd
code_key_cols = ["start_code", "end_code", "label"]
code_key = [[1, 2, "a"],
[3, 4, "b"],
[5, 6, "c"],
[7, 8, "d"]]
code_df = pd.DataFrame(code_key, columns=code_key_cols)

Out[]: start_code end_code label
0 1 2 a
1 3 4 b
2 5 6 c
3 7 8 d

数据

然后我有一堆数据,它们只是这些代码的时间序列。我想以这样的方式组织它们以绘制甘特图。对于情节来说,这意味着有一个任务开始完成列。

(仅创建虚假数据,例如,模仿实际数据的行为,其中相同的操作类型不能并行发生两次,只能同时发生)

from random import shuffle
data = []
for i in range(3000):
start_codes = [x for x in code_df.iloc[:, 0]]
end_codes = [x for x in code_df.iloc[:, 1]]
shuffle(start_codes)
shuffle(end_codes)
[data.append(x) for x in start_codes]
[data.append(x) for x in end_codes]

data_cols = ["code", "time"]
data_df = pd.DataFrame()
data_df['code'] = data
data_df['time'] = pd.date_range(start="19700101", periods=len(data))

print(data_df.head())
code time
0 3 1970-01-01
1 1 1970-01-02
2 7 1970-01-03
3 5 1970-01-04
4 2 1970-01-05

我的尝试:

我可以做到,但只能以非常缓慢的方式,逐行迭代!我确信 pandas 有更有效的方法来做到这一点。你会怎么做?我就是这样做的,但是 df 为 12K 行,需要超过 13 秒:(

import numpy as np
lst = []
for _, code_row in code_df.iterrows():
begin = True
task = np.nan
start = np.nan
finish = np.nan
for _, data_row in data_df.iterrows():
if begin:
if code_row['start_code'] == data_row['code']:
task = code_row.label
start = data_row.time
begin = False
else:
if code_row['end_code'] == data_row['code']:
finish = data_row.time
begin = True
lst.append([task, start, finish])

df3 = pd.DataFrame(data=lst, columns=["Task", 'Start', 'Finish'])

输出

为了上下文,我将展示目标,使用以下代码绘制甘特图(为了简化,将上面的 for i in range 从 3000 更改为 10)。

import plotly.figure_factory as ff
import plotly.io as pio
pio.renderers.default = "browser"

fig = ff.create_gantt(df3, group_tasks=True)
fig.show()

Gantt chart example with 10 iterations顺便说一句,如果您读到这里,非常感谢您抽出时间! :)

最佳答案

希望这有帮助。这应该给你相同的输出:

# we'll create a new dataframe out of two slices on data_df (resulting in two new dataframes), namely those rows belonging to start_code and those belonging to end_code.
# next, sort the slices on code and time such that our slices match in order (this builds on the concurrent assumption you stated)
# drop unwanted columns and rename others as desired
# reset indices as otherwise pd.concat tries to adhere to the old indices
# merge the labels from code_df

df3_new = pd.concat([
data_df[data_df.code.isin(code_df.start_code)]
.sort_values(['code', 'time'])
.reset_index(drop=True)
.rename(columns={'time': 'Start'}),
data_df[data_df.code.isin(code_df.end_code)]
.sort_values(['code', 'time'])
.reset_index(drop=True)
.rename(columns={'time': 'Finish'})
.drop('code', axis=1)
], axis=1) \
.merge(code_df, how='left', left_on='code', right_on='start_code') \
.drop(['code', 'start_code', 'end_code'], axis=1) \
.rename(columns={'label': 'Task'})

# which yields the same outcome (for the given set at least)
df3.equals(df3_new.loc[:, ['Task','Start', 'Finish']])
True

给定集合的平均性能如下:

12.5 ms ± 435 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

关于python - 使用键重新组合 pandas 中的数据框。比迭代行更快的方法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59792375/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com