gpt4 book ai didi

python - 如何按 4 个时间段和键对数据帧进行分组

转载 作者:行者123 更新时间:2023-12-03 09:26:34 25 4
gpt4 key购买 nike

我有一个看起来像这样的数据集:

date    area_key    total_units timeatend   starthour   timedifference  vps
2020-01-15 08:22:39 0 9603 2020-01-15 16:32:39 8 29400.0 0.32663265306122446
2020-01-13 08:22:07 0 10273 2020-01-13 16:25:08 8 28981.0 0.35447362064801075
2020-01-23 07:16:55 3 5175 2020-01-23 14:32:44 7 26149.0 0.19790431756472524
2020-01-15 07:00:06 1 838 2020-01-15 07:46:29 7 2783.0 0.3011139058569889
2020-01-15 08:16:01 1 5840 2020-01-15 12:41:16 8 15915.0 0.3669494187873076
然后将其计算到其中以创建 kmeans 集群。
def cluster_Volume(inputData):

start_tot = time.time()
Volume = inputData.groupby(['Startdtm'])['vehiclespersec'].sum().unstack()


## 4 Clusters
model = clstr.MiniBatchKMeans(n_clusters=5)
model.fit(Volume.fillna(0))
Volume['kmeans_4'] = model.predict(Volume.fillna(0))
end_tot = time.time()
print("Completed in " + str(end_tot-start_tot))

## 8 Clusters
start_tot = time.time()
model = clstr.KMeans(n_clusters=8)
model.fit(Volume.fillna(0))
Volume['kmeans_8'] = model.predict(Volume.fillna(0))
end_tot = time.time()
print("Completed in " + str(end_tot-start_tot))

## Looking at hourly distribution.
start_tot = time.time()
Volume_Hourly = Volume.reset_index().set_index(['Startdtm'])

Volume_Hourly['hour'] = Volume_Hourly.index.hour
end_tot = time.time()
print("Completed in " + str(end_tot-start_tot))

return Volume, Volume_Hourly
我想要做的是使这些集群与时间段和键相关。
随着时间段——
早上7点到10点,下午4点到6点,12点到2点,下午6点到12点,12点到7点,10点到12点,2点到4点为其他时间段。
并使用 key - 以编程方式显示每个集群的不同之处。
想要的结果
所需的结果将有一个类似于下表的表格,但请随意以您能想到的最佳方式开发它。时间段的意思是,说 1 是在早上 6 点之前,2 - 6 点到 9 点,3 - 9 到 11,4 - 11 到 14 点等等。但是可以随意更改它作为适合 - 只是我的想法
我已经尝试了一些使用 groupby 的方法来解决这个问题。 ,但它似乎不太好用 - 希望在这里得到一些指导。
惊人的 react ,哇欢呼。让我意识到我正在错误地接近这个,但对于修复我的方法仍然非常有值(value)。
此数据以个别事件为例。
DateTimeStamp   VS_ID   VS_Summary_Id   Hostname    Vehicle_speed   Lane    Length
11/01/2019 8:22 1 1 place_uno 65 2 71
11/01/2019 8:22 2 1 place_uno 59 1 375
11/01/2019 8:22 3 1 place_uno 59 1 389
11/01/2019 8:22 4 1 place_duo 59 1 832
11/01/2019 8:22 5 1 place_duo 52 1 409
为了获得卷,我需要在较小的卷块中随时间聚合(15 秒或 15 分钟,将在下面发布代码)。
然后基本相同的想法。另一个贪婪的问题是 - 我将如何将速度插入到这个测量中?即,大量的体积,但低速,也很好地迎合。
惊人的惊人的东西
对于每 15 秒的这些体积计算,我想对这些进行聚类,因为汇总表的基础太广了,但我认为对于已链接的内容,无论我是否稍微调整一下,都可以这样做,我已经意识到汇总表太宽泛了,因此除非我使用这个,否则进行时间聚类是行不通的,所以随着时间的推移 k 均值最适合新数据,并且该聚类的平均速度是如果有意义,允许考虑速度
再次感谢惊人的帮助,将这样做以适应下面的代码,但是忘了链接它,它可以帮助使它更具体和更有值(value)。
谢谢你们!

最佳答案

第一个数据(注意:其他部分与更新有关)
数据非常有限,可能是由于简化它的复杂性,所以我将做一些假设并尽可能通用地编写它,以便您可以根据需要快速定制它。
假设:

  • 您想按小时窗口(“hour_code”)对数据进行分组(因此将数据分组的参数化为 group_divide_set_by_column )
  • 对于每个小时窗口(“hour_code”),您希望使用 K 均值算法按位置进行聚类

  • 这样做可以让您分别调查每个小时窗口的车辆集群,并了解哪些集群区域更活跃并需要注意。
    笔记:
  • 位置列(尽管已注明)缺失并且是 K-means 算法所必需的(我使用了 HostName_key,但它只是一个虚拟值,因此代码可以运行,它不一定有意义)。
    一般来说,K-means algorithm用于具有欧几里得距离的空间(从数学上讲,这意味着根据平均值生成的 Voronoi 图对观察进行分区。)
    以下是 k-means Python 示例的一些来源,可用于进一步自定义它:1 2 3 4 .

  • 代码:
  • 让我们定义一个函数,它给定一个数据帧组,将它除以给定的列,group_divide_set_by_column .

  • 这将允许我们按“hour_code”分组划分,然后按位置聚类。
    def create_clusters_by_group(df, group_divide_set_by_column='hour_code', clusters_number_list=[2, 3]):
    # Divide et by hours
    divide_df_by_hours(df)
    lst_df_by_groups = {f'{group_divide_set_by_column}_{i}': d for i, (g, d) in enumerate(df.groupby(group_divide_set_by_column))}
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
    # Divide to desired amount of clusters
    for clusters_number in clusters_number_list:
    create_cluster(group_df, clusters_number)
    # Setting column types
    set_colum_types(group_df)
    return lst_df_by_groups
  • #1 函数将使用另一个函数来转换 hourhour codes ,类似于你的措辞:

  • Time period meaning, say 1 would be before 6 am, 2 - 6 am to 9 am, 3 - 9 to 11, 4 - 11 to 14, etc..

    def divide_df_by_hours(df):
    def get_hour_code(h, start_threshold=6, end_threshold=21, windows=3):
    """
    Divide hours to groups:
    Hours:
    1-5 => 1
    6-8 => 2
    9-11 => 3
    12-14 => 4
    15-17 => 5
    18-20 => 6
    21+ => 7
    """
    if h < start_threshold:
    return 1
    elif h >= end_threshold:
    return (end_threshold // windows)
    return h // windows
    df['hour_code'] = df['starthour'].apply(lambda h : get_hour_code(h))
  • 此外,#1 函数将使用 set_colum_types将列转换为其匹配类型的函数:
  • def set_colum_types(df):
    types_dict = {
    'Startdtm': 'datetime64[ns, Australia/Melbourne]',
    'HostName_key': 'category',
    'Totalvehicles': 'int32',
    'Enddtm': 'datetime64[ns, Australia/Melbourne]',
    'starthour': 'int32',
    'timedelta': 'float',
    'vehiclespersec': 'float',
    }
    for col, col_type in types_dict.items():
    df[col] = df[col].astype(col_type)
  • 一个专用timeit装饰器用于测量每次聚类的时间,因此减少样板代码

  • 全码:
    import functools

    import pandas as pd

    from timeit import default_timer as timer

    import sklearn
    from sklearn.cluster import KMeans

    def timeit(func):
    @functools.wraps(func)
    def newfunc(*args, **kwargs):
    startTime = timer()
    func(*args, **kwargs)
    elapsedTime = timer() - startTime
    print('function [{}] finished in {} ms'.format(
    func.__name__, int(elapsedTime * 1000)))
    return newfunc

    def set_colum_types(df):
    types_dict = {
    'Startdtm': 'datetime64[ns, Australia/Melbourne]',
    'HostName_key': 'category',
    'Totalvehicles': 'int32',
    'Enddtm': 'datetime64[ns, Australia/Melbourne]',
    'starthour': 'int32',
    'timedelta': 'float',
    'vehiclespersec': 'float',
    }
    for col, col_type in types_dict.items():
    df[col] = df[col].astype(col_type)

    @timeit
    def create_cluster(df, clusters_number):
    # Create K-Means model
    model = KMeans(n_clusters=clusters_number, max_iter=600, random_state=9)
    # Fetch location
    # NOTE: Should be a *real* location, used another column as dummy
    location_df = df[['HostName_key']]
    kmeans = model.fit(location_df)
    # Divide to clusters
    df[f'kmeans_{clusters_number}'] = kmeans.labels_

    def divide_df_by_hours(df):
    def get_hour_code(h, start_threshold=6, end_threshold=21, windows=3):
    """
    Divide hours to groups:
    Hours:
    1-5 => 1
    6-8 => 2
    9-11 => 3
    12-14 => 4
    15-17 => 5
    18-20 => 6
    21+ => 7
    """
    if h < start_threshold:
    return 1
    elif h >= end_threshold:
    return (end_threshold // windows)
    return h // windows
    df['hour_code'] = df['starthour'].apply(lambda h : get_hour_code(h))

    def create_clusters_by_group(df, group_divide_set_by_column='hour_code', clusters_number_list=[2, 3]):
    # Divide et by hours
    divide_df_by_hours(df)
    lst_df_by_groups = {f'{group_divide_set_by_column}_{i}': d for i, (g, d) in enumerate(df.groupby(group_divide_set_by_column))}
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
    # Divide to desired amount of clusters
    for clusters_number in clusters_number_list:
    create_cluster(group_df, clusters_number)
    # Setting column types
    set_colum_types(group_df)
    return lst_df_by_groups

    # Load data
    df = pd.read_csv('data.csv')

    # Print data
    print(df)

    # Create clusters
    lst_df_by_groups = create_clusters_by_group(df)

    # For each hostname-key dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
    print(f'Group {group_df_name} dataframe:')
    print(group_df)
    示例输出:
                  Startdtm  HostName_key  ...  timedelta vehiclespersec
    0 2020-01-15 08:22:39 0 ... 29400.0 0.326633
    1 2020-01-13 08:22:07 2 ... 28981.0 0.354474
    2 2020-01-23 07:16:55 3 ... 26149.0 0.197904
    3 2020-01-15 07:00:06 4 ... 2783.0 0.301114
    4 2020-01-15 08:16:01 1 ... 15915.0 0.366949
    5 2020-01-16 08:22:39 2 ... 29400.0 0.326633
    6 2020-01-14 08:22:07 2 ... 28981.0 0.354479
    7 2020-01-25 07:16:55 4 ... 26149.0 0.197904
    8 2020-01-17 07:00:06 1 ... 2783.0 0.301114
    9 2020-01-18 08:16:01 1 ... 15915.0 0.366949

    [10 rows x 7 columns]
    function [create_cluster] finished in 10 ms
    function [create_cluster] finished in 11 ms
    function [create_cluster] finished in 10 ms
    function [create_cluster] finished in 11 ms
    function [create_cluster] finished in 10 ms
    function [create_cluster] finished in 11 ms
    Group hour_code_0 dataframe:
    Startdtm HostName_key ... kmeans_2 kmeans_3
    0 2020-01-15 08:22:39+11:00 0 ... 1 1
    1 2020-01-13 08:22:07+11:00 2 ... 0 0
    2 2020-01-23 07:16:55+11:00 3 ... 0 2

    [3 rows x 10 columns]
    Group hour_code_1 dataframe:
    Startdtm HostName_key ... kmeans_2 kmeans_3
    3 2020-01-15 07:00:06+11:00 4 ... 1 1
    4 2020-01-15 08:16:01+11:00 1 ... 0 0
    5 2020-01-16 08:22:39+11:00 2 ... 0 2

    [3 rows x 10 columns]
    Group hour_code_2 dataframe:
    Startdtm HostName_key ... kmeans_2 kmeans_3
    6 2020-01-14 08:22:07+11:00 2 ... 1 2
    7 2020-01-25 07:16:55+11:00 4 ... 0 0
    8 2020-01-17 07:00:06+11:00 1 ... 1 1
    9 2020-01-18 08:16:01+11:00 1 ... 1 1

    [4 rows x 10 columns]

    更新:第二个数据
    所以,这一次会让事情有点不同,因为更新的目标是了解每个地方有多少车辆及其速度。
    同样,为了便于适应,一般都非常小心地编写了东西。
  • 首先,我们根据主机名推断出的位置将数据集划分为组(参数化为 dividing_colum )。
  • def divide_df_by_column(df, dividing_colum='Hostname'):
    df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
    return df_by_groups
  • 现在,我们为每个主机名( dividing_colum )组排列数据:
  • def arrange_groups_df(lst_df_by_groups):
    df_by_intervaled_group = dict()
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
    df_by_intervaled_group[group_df_name] = arrange_data(group_df)
    return df_by_intervaled_group
    2.1.我们以15分钟为间隔进行分组,每个主机名区域数据划分时间间隔后,我们将车辆数量汇总到 volume列。并调查到列 average_speed 的平均速度.
    def group_by_interval(df):
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    intervaled_df = df.groupby([pd.Grouper(key=DATE_COLUMN_NAME, freq=INTERVAL_WINDOW)]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
    return intervaled_df

    def arrange_data(df):
    df = group_by_interval(df)
    return df
    第 2 阶段的最终结果是每个主机名数据被划分为 15 分钟的时间窗口,我们知道每次通过了多少辆车以及它们的平均速度是多少。
    通过这种方式,我们实现了目标:

    An additional, and greedy question, would be - how would i interpolate speed into this measurement? i.e., large amounts of volumes, but low speeds, would be good to also cater for.


    同样,所有可使用 [ TIME_INTERVAL_COLUMN_NAME , DATE_COLUMN_NAME , INTERVAL_WINDOW ]。
    整个代码:
    import functools

    import numpy
    import pandas as pd

    TIME_INTERVAL_COLUMN_NAME = 'time_interval'

    DATE_COLUMN_NAME = 'DateTimeStamp'

    INTERVAL_WINDOW = '15Min'

    def round_time(df):
    # Setting date_column_name to be of dateime
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Grouping by interval
    df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)

    def group_by_interval(df):
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    intervaled_df = df.groupby([pd.Grouper(key=DATE_COLUMN_NAME, freq=INTERVAL_WINDOW)]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
    return intervaled_df

    def arrange_data(df):
    df = group_by_interval(df)
    return df

    def divide_df_by_column(df, dividing_colum='Hostname'):
    df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
    return df_by_groups

    def arrange_groups_df(lst_df_by_groups):
    df_by_intervaled_group = dict()
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
    df_by_intervaled_group[group_df_name] = arrange_data(group_df)
    return df_by_intervaled_group

    # Load data
    df = pd.read_csv('data2.csv')

    # Print data
    print(df)

    # Divide by column
    df_by_groups = divide_df_by_column(df)

    # Arrange data for each group
    df_by_intervaled_group = arrange_groups_df(df_by_groups)

    # For each hostname-key dataframe
    for group_df_name, intervaled_group_df in df_by_intervaled_group.items():
    print(f'Group {group_df_name} dataframe:')
    print(intervaled_group_df)
    示例输出:
    我们现在可以通过测量每个单独主机名区域的数量(车辆数量)和平均速度来获得有值(value)的结果。
         DateTimeStamp  VS_ID  VS_Summary_Id   Hostname  Vehicle_speed  Lane  Length
    0 11/01/2019 8:22 1 1 place_uno 65 2 71
    1 11/01/2019 8:23 2 1 place_uno 59 1 375
    2 11/01/2019 8:25 3 1 place_uno 59 1 389
    3 11/01/2019 8:26 4 1 place_duo 59 1 832
    4 11/01/2019 8:40 5 1 place_duo 52 1 409
    Group Hostname_place_duo dataframe:
    average_speed volume
    DateTimeStamp
    2019-11-01 08:15:00 59 1
    2019-11-01 08:30:00 52 1
    Group Hostname_place_uno dataframe:
    average_speed volume
    DateTimeStamp
    2019-11-01 08:15:00 61 3
    附录
    还创建了一个 round_time函数,它允许舍入到时间间隔,无需分组:
    def round_time(df):
    # Setting date_column_name to be of dateime
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Grouping by interval
    df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)
    第三次更新
    所以这次我们要减少结果中的行数。
  • 我们改变了我们对数据进行分组的方式,不仅基于时间间隔,还基于一周中的每一天,结果将允许我们调查一周中每一天的流量行为,它是 15 分钟的时间间隔。group_by_interval函数现在更改为基于简洁间隔的组,因此将被称为 group_by_concised_interval .

  • 我们将 [day-in-week, hour-minute] 的组合称为“consice 间隔”,这同样可以通过 CONCISE_INTERVAL_FORMAT 进行配置。 .
    def group_by_concised_interval(df):
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Rounding time
    round_time(df)
    # Adding concised interval
    add_consice_interval_columns(df)
    intervaled_df = df.groupby([TIME_INTERVAL_CONCISE_COLUMN_NAME]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
    return intervaled_df
    1.1. group_by_concised_interval使用 INTERVAL_WINDOW 将第一轮时间设置为给定的 15 分钟间隔(可通过 round_time 配置)方法。
    1.2.在为每个日期创建时间间隔后,我们应用 add_consice_interval_columns给定四舍五入到整数时间戳的函数,提取简明形式。
    def add_consice_interval_columns(df):
    # Adding columns for time interval in day-in-week and hour-minute resolution
    df[TIME_INTERVAL_CONCISE_COLUMN_NAME] = df[TIME_INTERVAL_COLUMN_NAME].apply(lambda x: x.strftime(CONCISE_INTERVAL_FORMAT))
    整个代码是:
    import functools

    import numpy
    import pandas as pd

    TIME_INTERVAL_COLUMN_NAME = 'time_interval'

    TIME_INTERVAL_CONCISE_COLUMN_NAME = 'time_interval_concise'

    DATE_COLUMN_NAME = 'DateTimeStamp'

    INTERVAL_WINDOW = '15Min'

    CONCISE_INTERVAL_FORMAT = '%A %H:%M'

    def round_time(df):
    # Setting date_column_name to be of dateime
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Grouping by interval
    df[TIME_INTERVAL_COLUMN_NAME] = df[DATE_COLUMN_NAME].dt.round(INTERVAL_WINDOW)

    def add_consice_interval_columns(df):
    # Adding columns for time interval in day-in-week and hour-minute resolution
    df[TIME_INTERVAL_CONCISE_COLUMN_NAME] = df[TIME_INTERVAL_COLUMN_NAME].apply(lambda x: x.strftime(CONCISE_INTERVAL_FORMAT))

    def group_by_concised_interval(df):
    df[DATE_COLUMN_NAME] = pd.to_datetime(df[DATE_COLUMN_NAME])
    # Rounding time
    round_time(df)
    # Adding concised interval
    add_consice_interval_columns(df)
    intervaled_df = df.groupby([TIME_INTERVAL_CONCISE_COLUMN_NAME]).agg({'Vehicle_speed' : 'mean', 'Hostname' : 'count'}).rename(columns={'Vehicle_speed' : 'average_speed', 'Hostname' : 'volume'})
    return intervaled_df

    def arrange_data(df):
    df = group_by_concised_interval(df)
    return df

    def divide_df_by_column(df, dividing_colum='Hostname'):
    df_by_groups = {f'{dividing_colum}_{g}': d for i, (g, d) in enumerate(df.groupby(dividing_colum))}
    return df_by_groups

    def arrange_groups_df(lst_df_by_groups):
    df_by_intervaled_group = dict()
    # For each group dataframe
    for group_df_name, group_df in lst_df_by_groups.items():
    df_by_intervaled_group[group_df_name] = arrange_data(group_df)
    return df_by_intervaled_group

    # Load data
    df = pd.read_csv('data2.csv')

    # Print data
    print(df)

    # Divide by column
    df_by_groups = divide_df_by_column(df)

    # Arrange data for each group
    df_by_intervaled_group = arrange_groups_df(df_by_groups)

    # For each hostname-key dataframe
    for group_df_name, intervaled_group_df in df_by_intervaled_group.items():
    print(f'Group {group_df_name} dataframe:')
    print(intervaled_group_df)
    输出:
    Group Hostname_place_duo dataframe:
    average_speed volume
    time_interval_concise
    Friday 08:30 59 1
    Friday 08:45 52 1
    Group Hostname_place_uno dataframe:
    average_speed volume
    time_interval_concise
    Friday 08:15 65 1
    Friday 08:30 59 2
    所以现在我们可以很容易地找出一周中每一天在所有可用时间间隔内的流量行为。

    关于python - 如何按 4 个时间段和键对数据帧进行分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63877261/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com