gpt4 book ai didi

Python:使用 pandas.pivot_table 展平事件日志并显示执行事件所花费的时间

转载 作者:太空宇宙 更新时间:2023-11-03 16:25:59 25 4
gpt4 key购买 nike

我正在考虑使用 python 和 pandas 来压平我们的 VLE (Blackboard inc.) 事件表。我试图总结每天访问类(class)所花费的总时间,而不是在事件日志/表中进行其他非类(class)事件。

我在下面创建了一些假数据和代码(python)来模拟问题和我正在努力的地方。这是我正在努力解决的 flattened_v2 部分,因为它接近我的实际情况。

日志数据通常如下所示,我在下面的代码示例中创建了它:(下面代码中的事件数据帧)

         DAY    event somethingelse  timespent             logtime
0 2013-01-02 null foo 0.274139 2013-01-02 00:00:00
0 2013-01-02 course1 foo 1.791061 2013-01-02 01:00:00
1 2013-01-02 course1 foo 0.824152 2013-01-02 02:00:00
2 2013-01-02 course1 foo 1.626477 2013-01-02 03:00:00

我在真实数据中有一个名为 logtime 的字段。这是一个实际的日期时间而不是花费的时间字段(也包含在我实验时的假数据中)。

如何记录在事件 = 类(class)(许多类(class))上花费的总时间(使用 logtime)?

每条记录都包含日志时间,显示访问页面的日期时间下一条记录日志时间显示访问新页面并因此离开旧页面的日期时间(足够接近)。如何获得事件不为空的总时间。如果我只使用最大/最小值,那么这会导致高估,因为类(class)访问中的差距(其中事件 = null)也包括在内。我简化了数据,使每个记录增加 1 小时,但这不是真实情况。

感谢您的任何提示杰森

代码是:

# dataframe example
# How do I record total time spent on event = course (many courses)?
# Each record contains logtime which shows datetime to access page
# Next record logtime shows the datetime accessing new page and
# therefore leaving old page (close enough)
#
#

import pandas as pd
import numpy as np
import datetime


# Creating fake data with string null and course1, course2
df = pd.DataFrame({
'DAY' : pd.Timestamp('20130102'),
'timespent' : abs(np.random.randn(5)),
'event' : "course1",
'somethingelse' : 'foo' })

df2 = pd.DataFrame({
'DAY' : pd.Timestamp('20130102'),
'timespent' : abs(np.random.randn(5)),
'event' : "course2",
'somethingelse' : 'foo' })

dfN =pd.DataFrame({
'DAY' : pd.Timestamp('20130102'),
'timespent' : abs(np.random.randn(1)),
'event' : "null",
'somethingelse' : 'foo' })


dfLog = [dfN, df,df2,dfN,dfN,dfN,df2,dfN,dfN,df,dfN,df2,dfN,df,df2,dfN, ]
activity = pd.concat(dfLog)
# add time column
times = pd.date_range('20130102', periods=activity.shape[0], freq='H')
activity['logtime'] = times

# activity contains a DAY field (probably not required)
# timespent -this is fake time spent on each event. This is
# not in my real data but I started this way when faking data
# event -either a course or null (not a course)
# somethingelse -just there to indicate other data.
#

print activity # This is quite close to real data.

# Fake activity date created above to demo question.

# *********************************************
# Actual code to extract time spent on courses
# *********************************************

# Lambda function to aggregate data -max and min

# Where time diff each minutes.
def agg_timespent(a, b):
c = abs(b-a)
return c

# Where the time difference is not explicit but is
# record of time recorded when accessing page (course event)
def agg_logtime(a, b):
# In real data b and a are strings
# b = datetime.datetime.strptime(b, '%Y-%m-%d %H:%M:%S')
# a = datetime.datetime.strptime(a, '%Y-%m-%d %H:%M:%S')
c = abs(b-a).seconds
return c



# Remove 'null' data as that's not of interest here.
# null means non course activity e.g. checking email
# or timetable -non course stuff.
activity= activity[(activity.event != 'null') ]

print activity # This shows *just* course activity info

# pivot by Day (only 1 day in fake data but 1 year in real data)
# Don't need DAY field but helped me fake-up data
flattened_v1 = activity.pivot_table(index=['DAY'], values=["timespent"],aggfunc=[min, max],fill_value=0)
flattened_v1['time_diff'] = flattened_v1.apply(lambda row: agg_timespent(row[0], row[1]), axis=1)


# How to achieve this?
# Where NULL has been removed I think this is wrong as NULL records could
# indicate several hours gap between course accesses but as
# I'm using MAX and MIN then I'm ignoring the periods of null
# This is overestimating time on courses
# I need to subtract/remove/ignore?? the hours spent on null times

flattened_v2 = activity.pivot_table(index=['DAY'], values=["logtime"],aggfunc=[min, max],fill_value=0)
flattened_v2['time_diff'] = flattened_v2.apply(lambda row: agg_logtime(row[0], row[1]), axis=1)

print
print '*****Wrong!**********'
print 'This is not what I have but just showing how I thought it might work.'
print flattened_v1
print
print '******Not sure how to do this*********'
print 'This is wrong as nulls/gaps are also included too'
print flattened_v2

最佳答案

你是对的(在你的评论中):你需要dataframe.shift

如果我正确理解你的问题,你想要记录自上次时间戳以来经过的时间,因此时间戳表示事件的开始,以及上次事件何时为null 我们不应该记录任何耗时。假设一切正确,请使用 shift 添加时差列:

activity['timelog_diff'] = activity['logtime'] - activity['logtime'].shift()

现在第一行将显示特殊的“不是时间”值 NaT,但这很好,因为我们无法计算那里的耗时。接下来,我们可以为刚刚发生 null 事件的任何耗时填写更多 NaT 值:

mask = activity.event == 'null'
activity.loc[mask.shift(1).fillna(False), 'timelog_diff'] = pd.NaT

当我们想知道在 course1 上花费了多少时间时,我们必须再次移动,因为 course1 行的索引将生成 course1 正在开始。我们需要那些 course1 正在完成/已经完成的内容:

activity[(activity.event == 'course1').shift().fillna(False)]['timelog_diff'].sum()

在您的示例中,course1 返回 15 小时,course2 返回 20 小时。

关于Python:使用 pandas.pivot_table 展平事件日志并显示执行事件所花费的时间,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37977881/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com