python - 为什么通过 bool 掩码过滤 DataFrame 比 apply() 快得多？-6ren

python - 为什么通过 bool 掩码过滤 DataFrame 比 apply() 快得多？

转载作者：太空宇宙更新时间：2023-11-03 14:14:04

25

4

我想比较两种不同方法过滤 pandas DataFrame 的性能。因此，我创建了一个在平面上包含 n 个点的测试集，并过滤掉了不在单位正方形内的所有点。我很惊讶一种方法比另一种方法快得多。 n 越大，差异就越大。对此有何解释？

这是我的脚本

import numpy as np
import time
import pandas as pd


# Test set with points
n              = 100000
test_x_points  = np.random.uniform(-10, 10, size=n)
test_y_points  = np.random.uniform(-10, 10, size=n)
test_points    = zip(test_x_points, test_y_points)
df             = pd.DataFrame(test_points, columns=['x', 'y'])


# Method a
start_time     = time.time()
result_a       = df[(df['x'] < 1) & (df['x'] > -1) & (df['y'] < 1) & (df['y'] > -1)]
end_time       = time.time()
elapsed_time_a = 1000 * abs(end_time - start_time)


# Method b
start_time     = time.time()
result_b       = df[df.apply(lambda row: -1 < row['x'] < 1 and -1 < row['y'] < 1, axis=1)]
end_time       = time.time()
elapsed_time_b = 1000 * abs(end_time - start_time)


# print results
print 'For {0} points.'.format(n)
print 'Method a took {0} ms and leaves us with {1} elements.'.format(elapsed_time_a, len(result_a))
print 'Method b took {0} ms and leaves us with {1} elements.'.format(elapsed_time_b, len(result_b))
print 'Method a is {0} X faster than method b.'.format(elapsed_time_b / elapsed_time_a)

不同n值的结果:

For 10 points.
Method a took 1.52087211609 ms and leaves us with 0 elements.
Method b took 0.456809997559 ms and leaves us with 0 elements.
Method a is 0.300360558081 X faster than method b.

For 100 points.
Method a took 1.55997276306 ms and leaves us with 1 elements.
Method b took 1.384973526 ms and leaves us with 1 elements.
Method a is 0.887819043252 X faster than method b.

For 1000 points.
Method a took 1.61004066467 ms and leaves us with 5 elements.
Method b took 10.448217392 ms and leaves us with 5 elements.
Method a is 6.48941211313 X faster than method b.

For 10000 points.
Method a took 1.59096717834 ms and leaves us with 115 elements.
Method b took 98.8278388977 ms and leaves us with 115 elements.
Method a is 62.1180878166 X faster than method b.

For 100000 points.
Method a took 2.14099884033 ms and leaves us with 1052 elements.
Method b took 995.483875275 ms and leaves us with 1052 elements.
Method a is 464.962360802 X faster than method b.

For 1000000 points.
Method a took 7.07101821899 ms and leaves us with 10045 elements.
Method b took 9613.26599121 ms and leaves us with 10045 elements.
Method a is 1359.5306494 X faster than method b.

当我将它与 Python 原生列表理解方法进行比较时，a 仍然快得多

result_c = [ (x, y) for (x, y) in test_points if -1 < x < 1 and -1 < y < 1 ]

这是为什么？

最佳答案

如果你关注 Pandas source code for apply你会发现，一般来说，它最终会执行一个 python for __ in __ 循环。

然而，Pandas DataFrames 是由 Pandas Series 组成的，而 Pandas Series 的底层是由 numpy 数组组成的。掩码过滤使用 numpy 数组允许的快速矢量化方法。有关为什么这比执行普通 Python 循环(如 .apply 中)更快的信息，请参阅 Why are NumPy arrays so fast?

那里的最佳答案:

Numpy arrays are densely packed arrays of homogeneous type. Python lists, by contrast, are arrays of pointers to objects, even when all of them are of the same type. So, you get the benefits of locality of reference.

Also, many Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking. The speed boost depends on which operations you're performing, but a few orders of magnitude isn't uncommon in number crunching programs.

关于python - 为什么通过 bool 掩码过滤 DataFrame 比 apply() 快得多？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48318858/

25

4

0

文章推荐： ssl - Intermediate 签名的证书显示为自签名证书

文章推荐： java - 使用附加代码的 SSLSocket 通信是否安全？

文章推荐： c# - 在 Unity 中使用规则解决？

JavaScript 函数 : Applying Apply
我被这种奇怪的事情难住了。假设我有这个数组: var array = [{ something: 'special' }, 'and', 'a', 'bunch', 'of', 'paramet
javascript - 为什么调用 Function.apply.bind(fn, null) 调用 `fn.apply` 而不是 `Function.apply` ？
假设我们有这样的代码: let fn1 = Function.apply.bind(Math.max, null); fn1([1, 10, 5]); // returns 10 我知道它是 ES6
javascript - Knockout.js 问题 : "h.apply is not a function. (In ' h. apply(e,r )', ' h.apply' 未定义)"
所以我尝试通过数据绑定(bind)调用我的 viewModel 原型(prototype)上的方法。我通过“单击”将两个不同的元素数据绑定(bind)到同一方法。当我单击第一个按钮(“新游戏”按钮)时
scala - 为什么我不能在Scala的this.apply(_)中省略 “apply”？
观察以下代码 trait Example { type O def apply(o: O) def f(o: O) = this.apply(o) } 在Scala中编译良好。我希望我可以
coq - 如何一起使用 'apply ... with'和 'apply ... in'？
我知道 apply f in H 可用于将假设应用于函数，并且我知道 apply f with a b c 可用于提供参数直接应用 f 时，它无法自行推断。是否可以以某种方式将两者结合使用？最佳答
Scala:尝试重载案例类 apply 方法时，apply 方法被定义了两次
这个问题已经有答案了: How to override apply in a case class companion (10 个回答) 已关闭 6 年前。我正在尝试重载案例类的 apply 方法:
grails - 如何从自定义Grails配置文件生成 “apply from”而不是 “apply plugin”？
我有一个自定义的Grails 4.x配置文件。我想为我的应用程序生成一个“apply from”条目。 apply from:"${rootProject.projectDir}/gradle/clo
javascript - this.constructor.apply 与 this.parent.apply
传统上对象继承如下所示: function Parent() { console.log('parent constructor'); } Parent.prototype.method = f
javascript - Function.prototype.apply.apply - 为什么调用它两次
今天在检查Jasmine 的源代码时here我偶然发现了以下内容: if (queueableFn.timeout) { timeoutId = Function.prototype.appl
javascript - 当新建一个包含 .apply 的函数时，.apply 如何工作？
据我所知，关键字new会使用this创建一个包含函数中定义的属性的对象。但我不知道如何应用使用 apply 将其他函数链接到该函数。并且创建的对象在这些函数中具有属性。有人能弄清楚代码中发生了什么吗
javascript - Apply {} 和 Apply {items :. ..} 之间的区别？
我一直在我的 InitComponent 中使用 Ext.Apply，就像这样 Ext.apply(that, { xtype: 'form', items: [.
git apply --reject 与 git apply --3way
我们有数百个存储库，并定期从上游接收补丁。作业应用这些补丁 git apply --check .如果没有错误，则应用补丁 git apply 并且更改已提交。如果有任何错误，补丁将标记为 conf
javascript - Function.apply 与 Function.prototype.apply
我最近通过调用 console.log.toString() 查看了 firebugs console.log 的代码并得到了这个: function () { return Function.app
angularjs - $scope.apply(); 之间的差异；和 $scope.apply(function(){});
拿这个代码: $scope.$apply(function(){ $scope.foo = 'test'; }); 对比这个: $scope.foo = 'test'; $scope.$app
sql - 与 `CROSS APPLY` 和 `OUTER APPLY` 不一致的行为
我在 Oracle-12c 中有一个类似于典型论坛的架构 accounts , posts , comments .我正在编写一个查询来获取... 一位用户该用户的所有帖子对每个帖子的评论以及每
angularjs - Angular $scope.$apply 与 $timeout 作为安全的 $apply
我试图更好地理解在 Angular 中使用 $timeout 服务作为一种“安全 $apply”方法的细微差别。基本上在一段代码可以运行以响应 Angular 事件或非 Angular 事件(例如 j
r - 批量预测；使用 apply() 函数而不是 for 循环。 apply() 函数给出不同点的预测
到目前为止，我使用的是 this当我有多个时间序列要预测时，我使用了 Hyndman 教授的方法。但是当我有大量的 ts 时它相当慢。现在我正在尝试使用 apply() 函数，如下所示 librar
python Pandas : can we avoid apply in this case of groupby/apply?
我听说过很多关于 pandas apply 很慢的说法，应该尽可能少用。我这里有个情况: df = pd.DataFrame({'Date': ['2019-01-02', '2019-01-03'
javascript - 在 apply 的重新声明中调用 Function.prototype.apply (Javascript)
在学习Javascript时，我尝试重新声明函数的apply属性。到目前为止没有问题。 function foo() { return 1; } alert(foo()); // 1 alert(fo
javascript - Apply.prototype.push.apply 与 forEach 对于嵌套数组？
所以我正在做 learnRx http://reactive-extensions.github.io/learnrx/我有一个关于制作 mergeAll() 函数的问题(问题 10)。这是我的答案

首页

博学

6Ren·AI

商城

python - 为什么通过 bool 掩码过滤 DataFrame 比 apply() 快得多？