c++ - std::vector 操作在某些系统上速度较慢-6ren

c++ - std::vector 操作在某些系统上速度较慢

转载作者：行者123 更新时间：2023-11-28 04:35:08

作为个人项目，我正在使用 C++ 开发一个简单的 2D 游戏引擎，其中包含实时碰撞物理。我的碰撞是通过计算唯一对象对之间的碰撞时间来处理的。为此，我使用 std::vector<float> 构建了自己的连续二维矩阵类。存储这些碰撞时间。

我的主要物理循环的一部分涉及向碰撞矩阵中的所有元素添加一个常量值，称为 Matrix2D::addConstValue(float) .出于某种原因，某些系统将此函数报告为在 gprof 中使用了大部分 CPU 时间。因此，该程序的运行速度通常比其他程序慢得多。例如，在一个系统上，一次发生大量碰撞会导致帧率下降。在更糟糕的系统上，这组完全相同的碰撞可能会使帧速率变为个位数，并显着降低模拟速度。

这些是我运行该程序的系统:

PC 1:

OS: Windows7
CPU: AMD Phenom II x4 960T
GPU: AMD Radeon HD6850
RAM: 8GB
Program performance: Good

PC2:

OS: Windows 10
CPU: Intel i5 2500K
GPU: AMD Radeon HD7970
RAM: 8GB
Program Performance: Bad

PC3 (laptop):

OS: Windows 10 + Xubuntu 16.04 (Dual boot)
CPU: Intel i5 5600u
GPU: Intel HD5000
RAM: 12GB
Program Performance: Good in Xubuntu, bad in Windows 10

PC4:

OS: Windows 10
CPU: AMD FX-6300
GPU: nVidia GTX 970
RAM: 8GB
Program Performance: Good

我原以为 PC2 的性能会优于 PC1，但由于调用了上述矩阵函数，PC2 报告的 CPU 使用率要高得多。下面是 PC1 和 PC2 的 gprof 结果

PC1:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 14.44      0.66     0.66 81222460     0.00     0.00  Ball::getDistance(Ball&)
 12.47      1.23     0.57 319194829     0.00     0.00  sfVectorMath::dot(sf::Vector2<float>, sf::Vector2<float>)
 12.47      1.80     0.57 55453088     0.00     0.00  Collisions::timeToCollision(Ball&, Ball&)
 11.16      2.31     0.51 81222460     0.00     0.00  Ball::getGPE(Ball&)
  6.78      2.62     0.31 153865899     0.00     0.00  Matrix2d::getElementValue(int, int)

PC2:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 77.83     23.49    23.49     8332     0.00     0.00  Matrix2d::addConstValue(float)
  7.59     25.78     2.29                             _mcount_private
  4.67     27.19     1.41 40603954     0.00     0.00  Collisions::timeToCollision(Ball&, Ball&)
  1.29     27.58     0.39                             pow
  1.19     27.94     0.36    11466     0.00     0.00  Matrix2d::getMatrixMin()
  0.99     28.24     0.30 206105049     0.00     0.00  sfVectorMath::dot(sf::Vector2<float>, sf::Vector2<float>)
  0.93     28.52     0.28                             internal_modf
  0.83     28.77     0.25 122492898     0.00     0.00  Matrix2d::getElementValue(int, int)

我真的不知道发生了什么事。其他一些细节:linux 和 windows 版本都是用 GCC 6.1.0 和 SFML 2.4.2 编译的。在 Windows 10 上进行 native 编译对性能没有任何影响。

编辑:另外，addConstValue 的实现

void Matrix2d::addConstValue(float value)
{
    for(unsigned int i=0; i<matrix.size(); ++i)
        matrix.at(i) += value;
}

最佳答案

TL;DR:不要将 NaN 存储在 vector 中，当然也不要尝试读取它们!还要尽量避免对 NaN 进行操作，以防万一。

我通过设置 242*242 矩阵并填充零或 std::numeric_limits<float>::quiet_NaN() 来测试矩阵类的性能.然后我执行了 addConstValue(float)矩阵上的函数。下表是每次通话所花费的平均时间。当矩阵用零填充时完成了 50000 次调用，当用 NaN 填充时完成了 500 次调用:

W10 2500k, filled with zeros: 34.54µs
W10 2500k, filled with NaNs: 6121.64µs
W7 960T, filled with zeros: 52.73µs
W7 960T, filled with NaNs: 62.4µs
W10 i5 5600u, filled with zeros: 27.50µs
W10 i5 5600u, filled with NaNs: 7062.63µs

因此，很明显，在 PC 2 和 3 上尝试对 NaN 进行操作的速度要慢大约 200 倍。奇怪的是，这个瓶颈在 AMD 机器上并不存在。然后我添加了一个快速检查以查看 vector 元素是否是 std::isnan() 内的 nan(使用 addConstValue(float) ) .以下是每次调用的执行时间:

W10 2500k, filled with zeros: 70.05µs
W10 2500k, filled with NaNs: 70.05µs
W10 i5 5600u, filled with zeros: 93.75µs
W10 i5 5600u, filled with NaNs: 62.50µs

这会导致用零填充的矩阵的执行时间加倍，但会显着减少用 NaN 填充的矩阵的执行时间。

为了进一步减少问题，我设置了一个循环，将一个常量 float 添加到一个裸 NaN，另一个添加到一个 std::vector。在 1000 万次循环中仅包含一个 NaN。这是程序:

#include <iostream>
#include <limits>
#include <chrono>
#include <vector>

using namespace std;
using namespace std::chrono;

int main()
{
    float nan = std::numeric_limits<float>::quiet_NaN();
    std::vector<float> nanvec = {nan};

    int noPasses = 10000000;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();

    for(int i=0; i<noPasses; ++i)
        nan += -1.0f;

    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    auto duration = duration_cast<microseconds>( t2 - t1 ).count();
    cout << "Bare float NaN: " << duration << " microseconds\n" ;


    t1 = high_resolution_clock::now();

    for(int i=0; i<noPasses; ++i)
        nanvec[0] += -1.0f;

    t2 = high_resolution_clock::now();
    duration = duration_cast<microseconds>( t2 - t1 ).count();
    cout << "Vector NaN: " << duration << " microseconds\n" ;

    return 0;
}

我的输出(W10、i5 2500k):

Bare float NaN: 0 microseconds
Vector NaN: 1122833 microseconds

所以看起来 CPU 知道要忽略 NaN 操作。从容器中检索 NaN 是否可能导致执行时间如此之长？我也仍然不知道为什么这个问题只会在某些系统上发生。

无论如何，我将检查 NaN 的快速修复程序合并到我的游戏引擎中，并且速度提升令人难以置信。不再有任何与从 vector 中提取 NaN 相关的瓶颈(使用 gprof 检查)。我可能会尝试找到一种方法来避免为每次调用获得额外 50% 的性能而进行检查。

关于c++ - std::vector 操作在某些系统上速度较慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51686596/

文章推荐： c++ - 将可变数量的参数传递给嵌入式 Python API

文章推荐： C++ - 如果循环内的语句无法正常工作

文章推荐： c++ - 从源代码本身获取输入 C++

javascript - 为什么 xpath 较慢
有人可以解释一下，在 DOM 中搜索元素时，为什么 Xpath 被认为比 CSS 选择器慢。不同的选择器是否有不同的引擎(例如 Xpath、CSS 选择器等) 谢谢最佳答案 Xpath 并不是被认为
c# - Ajax 调用在物理上不同的文件中对 Controller 较慢
在我们的一个 MVC 页面中尝试加速某些 ajax 调用时，我遇到了一些我无法真正解释的奇怪行为。我每隔 N 秒就会进行一些 ajax 调用，以轮询一些统计数据。似乎在物理上不同的文件中对 Cont
java - Apache Commons Lang StringUtils 较慢
Background 尝试进行一个简单的实验，看看传统的 if 语句检查 null 是否比 Apache Commons Lang StringUtils isEmpty/isBlank 更快。为了
android - 与 PC 相比，为什么 Android 中的响应时间(对于 Rest Call)较慢？
我正在从 Android 设备调用 rest api，并且看到与 PC 相比的速度差异，我感到非常惊讶。下面是来自 PC 上的休息工具的图像。我尝试了几个库，如 Retrofit、Volley 和常
python - 为什么 scipy.distance.cdist 在使用 float32 (较慢)和 float64 (较快)之间有很大的性能差异？
为什么 scipy.distance.cdist 使用 float32 和 float64 时性能差异很大？ from scipy.spatial import distance import num

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - std::vector 操作在某些系统上速度较慢