gpt4 book ai didi

python - 如何有效地计算 Pandas 时间序列中的滚动唯一计数?

转载 作者:太空狗 更新时间:2023-10-30 00:16:11 24 4
gpt4 key购买 nike

我有一个参观建筑物的时间序列。每个人都有一个唯一的 ID。对于时间序列中的每条记录,我想知道过去 365 天访问该建筑物的唯一人数(即以 365 天为窗口的滚动唯一人数)。

pandas 似乎没有用于此计算的内置方法。当存在大量唯一身份访问者和/或大窗口时,计算变得计算密集。 (实际数据比这个例子要大。)

有没有比我在下面所做的更好的计算方法?我不确定为什么我制作的快速方法 windowed_nunique(在“速度测试 3”下)关闭了 1。

感谢您的帮助!

相关链接:

初始化

在 [1] 中:

# Import libraries.
import pandas as pd
import numba
import numpy as np

在 [2] 中:

# Create data of people visiting a building.

np.random.seed(seed=0)
dates = pd.date_range(start='2010-01-01', end='2015-01-01', freq='D')
window = 365 # days
num_pids = 100
probs = np.linspace(start=0.001, stop=0.1, num=num_pids)

df = pd\
.DataFrame(
data=[(date, pid)
for (pid, prob) in zip(range(num_pids), probs)
for date in np.compress(np.random.binomial(n=1, p=prob, size=len(dates)), dates)],
columns=['Date', 'PersonId'])\
.sort_values(by='Date')\
.reset_index(drop=True)

print("Created data of people visiting a building:")
df.head() # 9181 rows × 2 columns

输出[2]:

Created data of people visiting a building:

| | Date | PersonId |
|---|------------|----------|
| 0 | 2010-01-01 | 76 |
| 1 | 2010-01-01 | 63 |
| 2 | 2010-01-01 | 89 |
| 3 | 2010-01-01 | 81 |
| 4 | 2010-01-01 | 7 |

速度引用

在 [3] 中:

%%timeit
# This counts the number of people visiting the building, not the number of unique people.
# Provided as a speed reference.
df.rolling(window='{:d}D'.format(window), on='Date').count()

每个循环 3.32 毫秒 ± 124 µs(7 次运行的平均值 ± 标准偏差,每次 100 次循环)

速度测试1

在 [4] 中:

%%timeit
df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())

每个循环 2.42 秒 ± 282 毫秒(7 次运行的平均值 ± 标准差,每次 1 个循环)

在 [5] 中:

# Save results as a reference to check calculation accuracy.
ref = df.rolling(window='{:d}D'.format(window), on='Date').apply(lambda arr: pd.Series(arr).nunique())['PersonId'].values

速度测试2

在 [6] 中:

# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def nunique(arr):
return len(set(arr))

在 [7] 中:

%%timeit
df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)

每个循环 430 毫秒 ± 31.1 毫秒(7 次运行的平均值 ± 标准偏差,每次 1 个循环)

在 [8] 中:

# Check accuracy of results.
test = df.rolling(window='{:d}D'.format(window), on='Date').apply(nunique)['PersonId'].values
assert all(ref == test)

速度测试3

在 [9] 中:

# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def windowed_nunique(dates, pids, window):
r"""Track number of unique persons in window,
reading through arrays only once.

Args:
dates (numpy.ndarray): Array of dates as number of days since epoch.
pids (numpy.ndarray): Array of integer person identifiers.
window (int): Width of window in units of difference of `dates`.

Returns:
ucts (numpy.ndarray): Array of unique counts.

Raises:
AssertionError: Raised if `len(dates) != len(pids)`

Notes:
* May be off by 1 compared to `pandas.core.window.Rolling`
with a time series alias offset.

"""

# Check arguments.
assert dates.shape == pids.shape

# Initialize counters.
idx_min = 0
idx_max = dates.shape[0]
date_min = dates[idx_min]
pid_min = pids[idx_min]
pid_max = np.max(pids)
pid_cts = np.zeros(pid_max, dtype=np.int64)
pid_cts[pid_min] = 1
uct = 1
ucts = np.zeros(idx_max, dtype=np.int64)
ucts[idx_min] = uct
idx = 1

# For each (date, person)...
while idx < idx_max:

# If person count went from 0 to 1, increment unique person count.
date = dates[idx]
pid = pids[idx]
pid_cts[pid] += 1
if pid_cts[pid] == 1:
uct += 1

# For past dates outside of window...
while (date - date_min) > window:

# If person count went from 1 to 0, decrement unique person count.
pid_cts[pid_min] -= 1
if pid_cts[pid_min] == 0:
uct -= 1
idx_min += 1
date_min = dates[idx_min]
pid_min = pids[idx_min]

# Record unique person count.
ucts[idx] = uct
idx += 1

return ucts

在 [10] 中:

# Cast dates to integers.
df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')
df['DateEpoch'] = df['DateEpoch'].astype(int)

在 [11] 中:

%%timeit
windowed_nunique(
dates=df['DateEpoch'].values,
pids=df['PersonId'].values,
window=window)

每个循环 107 µs ± 63.5 µs(7 次运行的平均值 ± 标准偏差,每次 1 个循环)

在 [12] 中:

# Check accuracy of results.
test = windowed_nunique(
dates=df['DateEpoch'].values,
pids=df['PersonId'].values,
window=window)
# Note: Method may be off by 1.
assert all(np.isclose(ref, np.asarray(test), atol=1))

在 [13] 中:

# Show where the calculation doesn't match.
print("Where reference ('ref') calculation of number of unique people doesn't match 'test':")
df['ref'] = ref
df['test'] = test
df.loc[df['ref'] != df['test']].head() # 9044 rows × 5 columns

输出[13]:

Where reference ('ref') calculation of number of unique people doesn't match 'test':

| | Date | PersonId | DateEpoch | ref | test |
|----|------------|----------|-----------|------|------|
| 78 | 2010-01-19 | 99 | 14628 | 56.0 | 55 |
| 79 | 2010-01-19 | 96 | 14628 | 56.0 | 55 |
| 80 | 2010-01-19 | 88 | 14628 | 56.0 | 55 |
| 81 | 2010-01-20 | 94 | 14629 | 56.0 | 55 |
| 82 | 2010-01-20 | 48 | 14629 | 57.0 | 56 |

最佳答案

我在快速方法 windowed_nunique 中有 2 个错误,现在在下面的 windowed_nunique_corrected 中更正了:

  1. 用于内存窗口内每个人 ID 的唯一计数的数组 pid_cts 的大小太小。
  2. 因为窗口的前缘和后缘包括整数天,date_min 应该在 (date - date_min + 1) > window 时更新。

相关链接:

在 [14] 中:

# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def windowed_nunique_corrected(dates, pids, window):
r"""Track number of unique persons in window,
reading through arrays only once.

Args:
dates (numpy.ndarray): Array of dates as number of days since epoch.
pids (numpy.ndarray): Array of integer person identifiers.
Required: min(pids) >= 0
window (int): Width of window in units of difference of `dates`.
Required: window >= 1

Returns:
ucts (numpy.ndarray): Array of unique counts.

Raises:
AssertionError: Raised if not...
* len(dates) == len(pids)
* min(pids) >= 0
* window >= 1

Notes:
* Matches `pandas.core.window.Rolling`
with a time series alias offset.

"""

# Check arguments.
assert len(dates) == len(pids)
assert np.min(pids) >= 0
assert window >= 1

# Initialize counters.
idx_min = 0
idx_max = dates.shape[0]
date_min = dates[idx_min]
pid_min = pids[idx_min]
pid_max = np.max(pids) + 1
pid_cts = np.zeros(pid_max, dtype=np.int64)
pid_cts[pid_min] = 1
uct = 1
ucts = np.zeros(idx_max, dtype=np.int64)
ucts[idx_min] = uct
idx = 1

# For each (date, person)...
while idx < idx_max:

# Lookup date, person.
date = dates[idx]
pid = pids[idx]

# If person count went from 0 to 1, increment unique person count.
pid_cts[pid] += 1
if pid_cts[pid] == 1:
uct += 1

# For past dates outside of window...
# Note: If window=3, it includes day0,day1,day2.
while (date - date_min + 1) > window:

# If person count went from 1 to 0, decrement unique person count.
pid_cts[pid_min] -= 1
if pid_cts[pid_min] == 0:
uct -= 1
idx_min += 1
date_min = dates[idx_min]
pid_min = pids[idx_min]

# Record unique person count.
ucts[idx] = uct
idx += 1

return ucts

在 [15] 中:

# Cast dates to integers.
df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')
df['DateEpoch'] = df['DateEpoch'].astype(int)

在 [16] 中:

%%timeit
windowed_nunique_corrected(
dates=df['DateEpoch'].values,
pids=df['PersonId'].values,
window=window)

每个循环 98.8 µs ± 41.3 µs(7 次运行的平均值 ± 标准偏差,每次 1 个循环)

在 [17] 中:

# Check accuracy of results.
test = windowed_nunique_corrected(
dates=df['DateEpoch'].values,
pids=df['PersonId'].values,
window=window)
assert all(ref == test)

关于python - 如何有效地计算 Pandas 时间序列中的滚动唯一计数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46470743/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com