python - 从 Pandas 数据框生成保留队列-6ren

python - 从 Pandas 数据框生成保留队列

转载作者：太空狗更新时间：2023-10-29 22:05:36

25

4

我有一个看起来像这样的 Pandas 数据框:

+-----------+------------------+---------------+------------+
| AccountID | RegistrationWeek | Weekly_Visits | Visit_Week |
+-----------+------------------+---------------+------------+
| ACC1      | 2015-01-25       |             0 | NaT        |
| ACC2      | 2015-01-11       |             0 | NaT        |
| ACC3      | 2015-01-18       |             0 | NaT        |
| ACC4      | 2014-12-21       |            14 | 2015-02-12 |
| ACC5      | 2014-12-21       |             5 | 2015-02-15 |
| ACC6      | 2014-12-21       |             0 | 2015-02-22 |
+-----------+------------------+---------------+------------+

它本质上是一种访问日志，因为它包含创建同期群分析所需的所有数据。

每个注册周都是一个群组。要知道有多少人属于我可以使用的群组:

visit_log.groupby('RegistrationWeek').AccountID.nunique()

我想做的是创建一个以注册周数为键的数据透视表。列应为 visit_weeks，值应为每周访问次数超过 0 次的唯一帐户 ID 的计数。

连同每个队列中的总账户数，我将能够显示百分比而不是绝对值。

最终产品看起来像这样:

+-------------------+-------------+-------------+-------------+
| Registration Week | Visit_week1 | Visit_Week2 | Visit_week3 |
+-------------------+-------------+-------------+-------------+
| week1             | 70%         | 30%         | 20%         |
| week2             | 70%         | 30%         |             |
| week3             | 40%         |             |             |
+-------------------+-------------+-------------+-------------+

我试过像这样旋转数据框:

visit_log.pivot_table(index='RegistrationWeek', columns='Visit_Week')

但我还没有确定值(value)部分。我需要以某种方式计算帐户 ID，并将总和除以上面的注册周聚合。

我是 pandas 的新手，所以如果这不是进行留存队列的最佳方法，请赐教!

谢谢

最佳答案

您的问题有几个方面。

您可以使用您拥有的数据构建什么

有several kinds of retention .为简单起见，我们将仅提及两个:

第 N 天留存率:如果用户在第 0 天注册，她是否在第 N 天登录？ (在第 N+1 天登录不会影响此指标)。要对其进行衡量，您需要跟踪用户的所有日志。
滚动保留:如果用户在第 0 天注册，她是否在第 N 天登录或之后的任何一天？ (在第 N+1 天登录会影响此指标)。要衡量它，您只需要用户的最新日志。

如果我对你的表的理解正确，你有两个相关变量来构建你的队列表:注册日期和上次日志(访问周)。每周访问的次数似乎无关紧要。

因此，您只能选择选项 2，滚动保留。

如何建表

首先，让我们构建一个虚拟数据集，以便我们有足够的工作量并且您可以重现它:

import pandas as pd
import numpy as np
import math
import datetime as dt

np.random.seed(0) # so that we all have the same results

def random_date(start, end,p=None):
    # Return a date randomly chosen between two dates
    if p is None:
        p = np.random.random()
    return start + dt.timedelta(seconds=math.ceil(p * (end - start).days*24*3600))

n_samples = 1000 # How many users do we want ?
index = range(1,n_samples+1)

# A range of signup dates, say, one year.
end = dt.datetime.today()
from dateutil.relativedelta import relativedelta 
start = end - relativedelta(years=1)

# Create the dataframe
users = pd.DataFrame(np.random.rand(n_samples),
                     index=index, columns=['signup_date'])
users['signup_date'] = users['signup_date'].apply(lambda x : random_date(start, end,x))
# last logs randomly distributed within 10 weeks of singing up, so that we can see the retention drop in our table
users['last_log'] = users['signup_date'].apply(lambda x : random_date(x, x + relativedelta(weeks=10)))

所以现在我们应该有这样的东西:

users.head()

下面是一些用于构建队列表的代码:

### Some useful functions
def add_weeks(sourcedate,weeks):
    return sourcedate + dt.timedelta(days=7*weeks)

def first_day_of_week(sourcedate):
    return sourcedate - dt.timedelta(days = sourcedate.weekday())

def last_day_of_week(sourcedate):
    return sourcedate + dt.timedelta(days=(6 - sourcedate.weekday()))  

def retained_in_interval(users,signup_week,n_weeks,end_date):
    '''
        For a given list of users, returns the number of users 
        that signed up in the week of signup_week (the cohort)
        and that are retained after n_weeks
        end_date is just here to control that we do not un-necessarily fill the bottom right of the table
    '''
    # Define the span of the given week
    cohort_start       = first_day_of_week(signup_week)
    cohort_end         = last_day_of_week(signup_week)
    if n_weeks == 0:
        # If this is our first week, we just take the number of users that signed up on the given period of time
        return len( users[(users['signup_date'] >= cohort_start) 
                        & (users['signup_date'] <= cohort_end)])
    elif pd.to_datetime(add_weeks(cohort_end,n_weeks)) > pd.to_datetime(end_date) :
        # If adding n_weeks brings us later than the end date of the table (the bottom right of the table),
        # We return some easily recognizable date (not 0 as it would cause confusion)
        return float("Inf")
    else:
        # Otherwise, we count the number of users that signed up on the given period of time,
        # and whose last known log was later than the number of weeks added (rolling retention)
        return len( users[(users['signup_date'] >= cohort_start) 
                        & (users['signup_date'] <= cohort_end)
                        & pd.to_datetime((users['last_log'])    >=  pd.to_datetime(users['signup_date'].map(lambda x: add_weeks(x,n_weeks))))
                        ])

有了这个我们就可以创建实际的函数了:

def cohort_table(users,cohort_number=6,period_number=6,cohort_span='W',end_date=None):
    '''
        For a given dataframe of users, return a cohort table with the following parameters :
        cohort_number : the number of lines of the table
        period_number : the number of columns of the table
        cohort_span : the span of every period of time between the cohort (D, W, M)
        end_date = the date after which we stop counting the users
    '''
    # the last column of the table will end today :
    if end_date is None:
        end_date = dt.datetime.today()
    # The index of the dataframe will be a list of dates ranging
    dates = pd.date_range(add_weeks(end_date,-cohort_number), periods=cohort_number, freq=cohort_span)

    cohort = pd.DataFrame(columns=['Sign up'])
    cohort['Sign up'] = dates
    # We will compute the number of retained users, column-by-column
    #      (There probably is a more pythonesque way of doing it)
    range_dates = range(0,period_number+1)
    for p in range_dates:
        # Name of the column
        s_p = 'Week '+str(p)
        cohort[s_p] = cohort.apply(lambda row: retained_in_interval(users,row['Sign up'],p,end_date), axis=1)

    cohort = cohort.set_index('Sign up')        
    # absolute values to percentage by dividing by the value of week 0 :
    cohort = cohort.astype('float').div(cohort['Week 0'].astype('float'),axis='index')
    return cohort

现在您可以调用它并查看结果:

cohort_table(users)

希望对你有帮助

关于python - 从 Pandas 数据框生成保留队列，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/28745650/

25

4

0

文章推荐： c# - 如何在 Socket 上调试这个第一次机会异常？

文章推荐： c# - 用于在匹配引号之间选择数据的正则表达式模式

文章推荐： c# - 在 C# 中编写代码段分隔符的更优雅的方式？

文章推荐： c# - OWIN OpenID 提供程序 - GetExternalLoginInfo() 返回 null

pandas - Pandas 交叉表与 Pandas 数据透视表有何不同？
pandas.crosstab 和 Pandas 数据透视表似乎都提供了完全相同的功能。有什么不同吗？最佳答案 pivot_table没有 normalize争论，不幸的是。在 crosstab
pandas - 从 pandas 值序列创建 pandas 区间序列
我能找到的最接近的答案似乎太复杂:How I can create an interval column in pandas? 如果我有一个如下所示的 pandas 数据框: +-------+ |
pandas - 将一列值移动到另一列 - Pandas
这是我用来将某一行的一列值移动到同一行的另一列的当前代码: #Move 2014/15 column ValB to column ValA df.loc[(df.Survey_year == 201
pandas - 如何将包含 bins 的 pandas 数据框写入文件以便将其读回 pandas？
我有一个以下格式的 Pandas 数据框: df = pd.DataFrame({'a' : [0,1,2,3,4,5,6], 'b' : [-0.5, 0.0, 1.0, 1.2, 1.4,
pandas - Pandas 数据框行上的克罗内克积
所以我有这两个数据框，我想得到一个新的数据框，它由两个数据框的行的克罗内克积组成。正确的做法是什么？举个例子:数据框1 c1 c2 0 10 100 1 11 110 2 12
pandas - Pandas 条形图中的刻度标签重叠
TL;DR:在 pandas 中，如何绘制条形图以使其 x 轴刻度标签看起来像折线图？我制作了一个间隔均匀的时间序列(每天一个项目)，并且可以像这样很好地绘制它: intensity[350:450
pandas - Pandas 中两个时间戳之间的差异
我有以下两个时间列，“Time1”和“Time2”。我必须计算 Pandas 中的“差异”列，即 (Time2-Time1): Time1 Time2
pandas - ( Pandas )根据顺序无关紧要的子集删除重复项
从这个 df 去的正确方法是什么: >>> df=pd.DataFrame({'a':['jeff','bob','jill'], 'b':['bob','jeff','mike']}) >>> df
pandas - Pandas 中唯一值的累积计数
我想按周从 Pandas 框架中的列中累积计算唯一值。例如，假设我有这样的数据: df = pd.DataFrame({'user_id':[1,1,1,2,2,2],'week':[1,1,2,1,
pandas - Pandas 更改数据透视表中列的顺序
数据透视表的表示形式看起来不像我在寻找的东西，更具体地说，结果行的顺序。我不知道如何以正确的方式进行更改。 df示例: test_df = pd.DataFrame({'name':['name_1
pandas - Pandas 中的分组召回
我有一个数据框，如下所示。 Category Actual Predicted 1 1 1 1 0
pandas - 计算从日期时间列到特定日期的天数 - pandas
我有一个 df，如下所示。 df: ID open_date limit 1 2020-06-03 100 1 2020-06-23 500
pandas - 删除不等于唯一项目值的行 - Pandas
我有一个 df ，其中包含与唯一值关联的各种字符串。对于这些唯一值，我想删除不等于单独列表的行，最后一行除外。下面使用 Label 中的各种字符串值与 Item 相关联.所以对于每个唯一的 Item
pandas - Pandas 按索引删除列会删除所有具有相同名称的列
考虑以下具有相同名称的列的数据框（显然，这确实发生了，目前我有一个像这样的数据集！:(） >>> df = pd.DataFrame({"a":range(10,15),"b":range(5,10)
pandas - Pandas DF中的重复行
我在 Pandas 中有一个 DF，它看起来像: Letters Numbers A 1 A 3 A 2 A 1 B 1 B 2
pandas - Pandas 两列之间的时差
如何减去两列之间的时间并将其转换为分钟 Date Time Ordered Time Delivered 0 1/11/19 9:25:00 am 10:58:00 am
pandas - pandas 使用哪种方法计算百分位数？
我试图理解 pandas 中的下/上百分位数计算，但有点困惑。这是它的示例代码和输出。 test = pd.Series([7, 15, 36, 39, 40, 41]) test.describe(
pandas - 如何提取多索引数据帧的索引名称，pandas
我有一个多索引数据框，如下所示: TQ bought HT Detailed Instru
pandas - Pandas :根据字符串计数创建直方图
我需要从包含值“低”，“中”或“高”的数据框列创建直方图。当我尝试执行通常的df.column.hist（）时，出现以下错误。 ex3.Severity.value_counts() Out[85]:
pandas - Pandas 中的子字符串列基于另一列
我试图根据另一列的长度对一列进行子串，但结果集是 NaN .我究竟做错了什么？ import pandas as pd df = pd.DataFrame([['abcdefghi','xyz'],

首页

博学

6Ren·AI

商城

python - 从 Pandas 数据框生成保留队列