gpt4 book ai didi

python - 为什么通过 bool 掩码过滤 DataFrame 比 apply() 快得多?

转载 作者:太空宇宙 更新时间:2023-11-03 14:14:04 25 4
gpt4 key购买 nike

我想比较两种不同方法过滤 pandas DataFrame 的性能。因此,我创建了一个在平面上包含 n 个点的测试集,并过滤掉了不在单位正方形内的所有点。我很惊讶一种方法比另一种方法快得多。 n 越大,差异就越大。对此有何解释?

这是我的脚本

import numpy as np
import time
import pandas as pd


# Test set with points
n = 100000
test_x_points = np.random.uniform(-10, 10, size=n)
test_y_points = np.random.uniform(-10, 10, size=n)
test_points = zip(test_x_points, test_y_points)
df = pd.DataFrame(test_points, columns=['x', 'y'])


# Method a
start_time = time.time()
result_a = df[(df['x'] < 1) & (df['x'] > -1) & (df['y'] < 1) & (df['y'] > -1)]
end_time = time.time()
elapsed_time_a = 1000 * abs(end_time - start_time)


# Method b
start_time = time.time()
result_b = df[df.apply(lambda row: -1 < row['x'] < 1 and -1 < row['y'] < 1, axis=1)]
end_time = time.time()
elapsed_time_b = 1000 * abs(end_time - start_time)


# print results
print 'For {0} points.'.format(n)
print 'Method a took {0} ms and leaves us with {1} elements.'.format(elapsed_time_a, len(result_a))
print 'Method b took {0} ms and leaves us with {1} elements.'.format(elapsed_time_b, len(result_b))
print 'Method a is {0} X faster than method b.'.format(elapsed_time_b / elapsed_time_a)

不同n值的结果:

For 10 points.
Method a took 1.52087211609 ms and leaves us with 0 elements.
Method b took 0.456809997559 ms and leaves us with 0 elements.
Method a is 0.300360558081 X faster than method b.

For 100 points.
Method a took 1.55997276306 ms and leaves us with 1 elements.
Method b took 1.384973526 ms and leaves us with 1 elements.
Method a is 0.887819043252 X faster than method b.

For 1000 points.
Method a took 1.61004066467 ms and leaves us with 5 elements.
Method b took 10.448217392 ms and leaves us with 5 elements.
Method a is 6.48941211313 X faster than method b.

For 10000 points.
Method a took 1.59096717834 ms and leaves us with 115 elements.
Method b took 98.8278388977 ms and leaves us with 115 elements.
Method a is 62.1180878166 X faster than method b.

For 100000 points.
Method a took 2.14099884033 ms and leaves us with 1052 elements.
Method b took 995.483875275 ms and leaves us with 1052 elements.
Method a is 464.962360802 X faster than method b.

For 1000000 points.
Method a took 7.07101821899 ms and leaves us with 10045 elements.
Method b took 9613.26599121 ms and leaves us with 10045 elements.
Method a is 1359.5306494 X faster than method b.

当我将它与 Python 原生列表理解方法进行比较时,a 仍然快得多

result_c = [ (x, y) for (x, y) in test_points if -1 < x < 1 and -1 < y < 1 ]

这是为什么?

最佳答案

如果你关注 Pandas source code for apply你会发现,一般来说,它最终会执行一个 python for __ in __ 循环。

然而,Pandas DataFrames 是由 Pandas Series 组成的,而 Pandas Series 的底层是由 numpy 数组组成的。掩码过滤使用 numpy 数组允许的快速矢量化方法。有关为什么这比执行普通 Python 循环(如 .apply 中)更快的信息,请参阅 Why are NumPy arrays so fast?

那里的最佳答案:

Numpy arrays are densely packed arrays of homogeneous type. Python lists, by contrast, are arrays of pointers to objects, even when all of them are of the same type. So, you get the benefits of locality of reference.

Also, many Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking. The speed boost depends on which operations you're performing, but a few orders of magnitude isn't uncommon in number crunching programs.

关于python - 为什么通过 bool 掩码过滤 DataFrame 比 apply() 快得多?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48318858/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com