c++ - 多线程:为什么两个程序比一个程序好？-6ren

c++ - 多线程:为什么两个程序比一个程序好？

转载作者：IT老高更新时间：2023-10-28 23:21:13

简单说说我的问题:

我有一台带有 2 个 AMD Opteron 6272 插槽和 64GB RAM 的计算机。

我在所有 32 个内核上运行一个多线程程序，与我在一个 16 内核插槽上运行 2 个程序的情况相比，速度降低了 15%。

如何让单程序版本和双程序版本一样快？

更多细节:

我有大量任务，想要完全加载系统的所有 32 个内核。所以我将任务按 1000 个分组打包。这样一个组需要大约 120Mb 的输入数据，在一个内核上完成大约需要 10 秒。为了使测试更理想，我将这些组复制了 32 次，并使用 ITBB 的 parallel_for 循环在 32 个内核之间分配任务。

我使用 pthread_setaffinity_np 来确保系统不会让我的线程在内核之间跳转。并确保所有内核都被依次使用。

我使用 mlockall(MCL_FUTURE) 来确保系统不会让我的内存在套接字之间跳转。

所以代码看起来像这样:

  void operator()(const blocked_range<size_t> &range) const
  {
    for(unsigned int i = range.begin(); i != range.end(); ++i){

      pthread_t I = pthread_self();
      int s;
      cpu_set_t cpuset;
      pthread_t thread = I;
      CPU_ZERO(&cpuset);
      CPU_SET(threadNumberToCpuMap[i], &cpuset);
      s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

      mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated

      TaskManager manager;
      for (int j = 0; j < fNTasksPerThr; j++){
        manager.SetData( &(InpData->fInput[j]) );
        manager.Run();
      }
    }
  }

只有计算时间对我来说很重要，因此我在单独的 parallel_for 循环中准备输入数据。并且不要在时间测量中包括准备时间。

  void operator()(const blocked_range<size_t> &range) const
  {
    for(unsigned int i = range.begin(); i != range.end(); ++i){

      pthread_t I = pthread_self();
      int s;
      cpu_set_t cpuset;
      pthread_t thread = I;
      CPU_ZERO(&cpuset);
      CPU_SET(threadNumberToCpuMap[i], &cpuset);
      s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

      mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated
      InpData[i].fInput = new ProgramInputData[fNTasksPerThr];

      for(int j=0; j<fNTasksPerThr; j++){
        InpData[i].fInput[j] = InpDataPerThread.fInput[j];
      }
    }
  }