gpt4 book ai didi

python Pandas : transform start and end datetime range (stored as 2 columns) to individual rows (eqpt utilisation)

转载 作者:太空狗 更新时间:2023-10-30 02:39:54 24 4
gpt4 key购买 nike

您好,我有一个如下 df 的数据集。我分别提供图像和示例数据框。

我想将原始数据帧 (df) 转换为转换后的数据帧 (dft),以便我可以看到每台设备在 24 小时内(甚至更长的时间长达 9 天)的利用率......以 5 分钟为间隔.然后可以使用 dft 绘制...描述的工具提示等。

当然,如果您有任何替代的更简单的解决方案而不是我下面的概述,也可以很好。

原始数据框 (df)

enter image description here

这是上面的数据框 (df),您可以将其复制粘贴到 jupyter 以创建它:

from io import StringIO
import pandas as pd

dfstr = StringIO(u"""
eqpt;starttm;endtm;use_count;desc
AT1;2017-04-01 10:35;2017-04-01 11:05;2;test asdf1
AT2;2017-04-01 11:00;2017-04-01 11:30;5;test asdf2
AT1;2017-04-01 11:00;2017-04-01 11:30;4;test asdf3
AT3;2017-04-01 10:45;2017-04-01 11:45;3;test asdf4
CBL1;2017-04-01 11:10;2017-04-1 11:40;4;test asdf5
""")
df = pd.read_csv(dfstr, sep=";")
df

我想将 df 转换为每个设备的单独行...说开始时间和结束时间从 2017-04-01 00:00 到 23:55 这样我就可以知道设备利用率在每 5 分钟的网格中以及绘制和重新采样以说每 1 小时的最大值以进行汇总等。

转换数据框 (dft)

这是转换后的图像..和示例结果数据帧(dft)如下:

此数据框的列来自原始数据框的“eqpt”。

刚刚意识到,如果只需要保留 use_counts 聚合一个数字,则描述列不能在同一个数据帧 dft 中。因此,请提供任何可以实现相同目的的替代解决方案,但将列保持为 float 状态仅用于计数,而描述文本在其他地方聚合。稍后可以合并或查找。

enter image description here

这是上面的数据框(dft):

dftstr = StringIO(u"""
datetime;Item;AT1;AT2;AT3;CBL1;AT_n
2017-04-01 10:30;use_count;;;;;
2017-04-01 10:35;use_count;2;;;;
2017-04-01 10:40;use_count;2;;;;
2017-04-01 10:45;use_count;2;;3;;
2017-04-01 10:50;use_count;2;;3;;
2017-04-01 10:55;use_count;2;;3;;
2017-04-01 11:00;use_count;6;5;3;;
2017-04-01 11:05;use_count;4;5;3;;
2017-04-01 11:10;use_count;4;5;3;4;
2017-04-01 11:15;use_count;4;5;3;4;
2017-04-01 11:20;use_count;4;5;3;4;
2017-04-01 11:25;use_count;4;5;3;4;
2017-04-01 11:30;use_count;;;3;4;
2017-04-01 11:35;use_count;;;3;4;
2017-04-01 11:40;use_count;;;3;;
2017-04-01 11:45;use_count;;;;;
2017-04-01 11:50;use_count;;;;;
2017-04-01 11:55;use_count;;;;;
2017-04-01 12:00;use_count;;;;;
2017-04-01 10:30;desc;;;;;
2017-04-01 10:35;desc;2: test_adf1;similar desc;;;
2017-04-01 10:40;desc;2: test_adf1;for;;;
2017-04-01 10:45;desc;2: test_adf1;the;;;
2017-04-01 10:50;desc;2: test_adf1;rest;;;
2017-04-01 10:55;desc;2: test_adf1;of;;;
2017-04-01 11:00;desc;"2: test_asdf1
4: test_asdf3";the;;;
2017-04-01 11:05;desc;4: test_asdf3;columns;;;
2017-04-01 11:10;desc;4: test_asdf3;;;;
2017-04-01 11:15;desc;4: test_asdf3;;;;
2017-04-01 11:20;desc;4: test_asdf3;;;;
2017-04-01 11:25;desc;4: test_asdf3;;;;
2017-04-01 11:30;desc;;;;;
2017-04-01 11:35;desc;;;;;
2017-04-01 11:40;desc;;;;;
2017-04-01 11:45;desc;;;;;
2017-04-01 11:50;desc;;;;;
2017-04-01 11:55;desc;;;;;
2017-04-01 12:00;desc;;;;;
;;and so on from 00:00 to 23:55;;;;
""")
dft = pd.read_csv(dftstr, sep=";")
dft

最佳答案

这里需要几个步骤。我使用了您的设置,但立即通过 parse_dates 将时间戳转换为 pandas 日期时间对象:

from io import StringIO
import pandas as pd

dfstr = StringIO(u"""
eqpt;starttm;endtm;use_count;desc
AT1;2017-04-01 10:35;2017-04-01 11:05;2;test asdf1
AT2;2017-04-01 11:00;2017-04-01 11:30;5;test asdf2
AT1;2017-04-01 11:00;2017-04-01 11:30;4;test asdf3
AT3;2017-04-01 10:45;2017-04-01 11:45;3;test asdf4
CBL1;2017-04-01 11:10;2017-04-1 11:40;4;test asdf5
""")

df = pd.read_csv(dfstr, sep=";", parse_dates=["starttm", "endtm"])
print(df)

eqpt starttm endtm use_count desc
0 AT1 2017-04-01 10:35:00 2017-04-01 11:05:00 2 test asdf1
1 AT2 2017-04-01 11:00:00 2017-04-01 11:30:00 5 test asdf2
2 AT1 2017-04-01 11:00:00 2017-04-01 11:30:00 4 test asdf3
3 AT3 2017-04-01 10:45:00 2017-04-01 11:45:00 3 test asdf4
4 CBL1 2017-04-01 11:10:00 2017-04-01 11:40:00 4 test asdf5

现在,这里有 3 个函数可以完成这项工作:

  • expand 获取输入的单行 df 并创建一个数据帧,其 DatetimeIndex 范围从 starttmendtm 以 5 分钟为间隔。此外,还添加了实际的 use_countdesc 值。
  • summarize 处理重叠,同时组合 desc 字符串并在同时多次使用设备时对 use_counts 求和。它必须进行类型检查,因为输入可能是 pandas SeriesDataFrame。如果只为单个设备提供一行,则 Series 将通过。否则,将传递一个 DataFrame
  • aggregate 结合了expandsummarize。首先,扩展并连接单个设备的所有条目(行)。然后,汇总展开的列。

就是这样。最后,您使用 groupby 对设备进行分组并应用 aggregate 功能:

def expand(row):
index = pd.date_range(row["starttm"], row["endtm"], freq="5min")
use_count=row["use_count"]
desc= "{}:{}".format(use_count, row["desc"])

return pd.DataFrame(index=index).assign(use_count=use_count, desc=desc)


def summarize(index, use_count, desc):
if isinstance(use_count, pd.DataFrame):
use_count = use_count.sum(axis=1)

if isinstance(desc, pd.DataFrame):
desc = desc.apply(lambda x: ", ".join(x.dropna()), axis=1)

return pd.DataFrame({"use_count": use_count, "desc": desc}, index=index)


def aggregate(sub_df):
dfs = pd.concat([expand(series) for idx, series in sub_df.iterrows()], axis=1)
return summarize(dfs.index, dfs["use_count"], dfs["desc"])


transformed = df.groupby("eqpt").apply(aggregate).unstack("eqpt")

结果数据框有多个索引列来区分 descuse_counts 允许正确的数据类型:

print(transformed["use_count"])

eqpt AT1 AT2 AT3 CBL1
2017-04-01 10:35:00 2.0 NaN NaN NaN
2017-04-01 10:40:00 2.0 NaN NaN NaN
2017-04-01 10:45:00 2.0 NaN 3.0 NaN
2017-04-01 10:50:00 2.0 NaN 3.0 NaN
2017-04-01 10:55:00 2.0 NaN 3.0 NaN
2017-04-01 11:00:00 6.0 5.0 3.0 NaN
2017-04-01 11:05:00 6.0 5.0 3.0 NaN
2017-04-01 11:10:00 4.0 5.0 3.0 4.0
2017-04-01 11:15:00 4.0 5.0 3.0 4.0
2017-04-01 11:20:00 4.0 5.0 3.0 4.0
2017-04-01 11:25:00 4.0 5.0 3.0 4.0
2017-04-01 11:30:00 4.0 5.0 3.0 4.0
2017-04-01 11:35:00 NaN NaN 3.0 4.0
2017-04-01 11:40:00 NaN NaN 3.0 4.0
2017-04-01 11:45:00 NaN NaN 3.0 NaN


print(transformed)

desc use_count
eqpt AT1 AT2 AT3 CBL1 AT1 AT2 AT3 CBL1
2017-04-01 10:35:00 2:test asdf1 None None None 2.0 NaN NaN NaN
2017-04-01 10:40:00 2:test asdf1 None None None 2.0 NaN NaN NaN
2017-04-01 10:45:00 2:test asdf1 None 3:test asdf4 None 2.0 NaN 3.0 NaN
2017-04-01 10:50:00 2:test asdf1 None 3:test asdf4 None 2.0 NaN 3.0 NaN
2017-04-01 10:55:00 2:test asdf1 None 3:test asdf4 None 2.0 NaN 3.0 NaN
2017-04-01 11:00:00 2:test asdf1, 4:test asdf3 5:test asdf2 3:test asdf4 None 6.0 5.0 3.0 NaN
2017-04-01 11:05:00 2:test asdf1, 4:test asdf3 5:test asdf2 3:test asdf4 None 6.0 5.0 3.0 NaN
2017-04-01 11:10:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:15:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:20:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:25:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:30:00 4:test asdf3 5:test asdf2 3:test asdf4 4:test asdf5 4.0 5.0 3.0 4.0
2017-04-01 11:35:00 None None 3:test asdf4 4:test asdf5 NaN NaN 3.0 4.0
2017-04-01 11:40:00 None None 3:test asdf4 4:test asdf5 NaN NaN 3.0 4.0
2017-04-01 11:45:00 None None 3:test asdf4 None NaN NaN 3.0 NaN

要跨越一整天的日期时间索引,您可以使用 reindex:

transformed.reindex(pd.date_range("2017-04-01 00:00", "2017-04-01 23:55", freq="5min"))

关于 python Pandas : transform start and end datetime range (stored as 2 columns) to individual rows (eqpt utilisation),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43154462/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com