c++ - 使用 alignas 防止错误共享被破坏-6ren

c++ - 使用 alignas 防止错误共享被破坏

转载作者：行者123 更新时间：2023-12-03 06:50:22

我不习惯在互联网上发布任何问题，所以如果我做错了什么，请告诉我。
简而言之

如何在 CPU 缓存行大小为 64 字节的 64 位架构上正确防止错误共享？

C++ 'alignas' 关键字和简单字节数组(例如:char[64])的使用如何影响多线程效率？

语境
在致力于非常有效地实现 Single Consumer Single Producer Queue 的同时，我在对我的代码进行基准测试时遇到了 GCC 编译器的不合逻辑行为。
全文
我希望有人有必要的知识来解释正在发生的事情。
我目前在 arch linux 上使用 GCC 10.2.0 及其 C++ 20 实现。我的笔记本电脑是带有 i7-7500U 处理器的联想 T470S。
让我从数据结构开始:

class SPSCQueue
{
public:
    ...

private:
    alignas(64) std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
    Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
    std::size_t _headCache { 0 }; // Head cache for the producer
    char _pad0[64 - sizeof(Buffer) - sizeof(std::size_t)]; // 64 bytes alignment padding

    alignas(64) std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
    Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
    std::size_t _tailCache { 0 }; // Head cache for the consumer
    char _pad1[64 - sizeof(Buffer) - sizeof(std::size_t)]; // 64 bytes alignment padding
};

以下数据结构在我的系统上推送/弹出时获得了快速而稳定的 20ns。
然而，只有使用以下成员更改对齐方式会使基准不稳定并在 20 到 30ns 之间。

    alignas(64) std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
    struct alignas(64) {
        Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
        std::size_t _headCache { 0 }; // Head cache for the producer
    };

    alignas(64) std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
    struct alignas(64) {
        Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer1
        std::size_t _tailCache { 0 }; // Tail cache for the consumer
    };

最后，当我尝试这个配置给我 40 到 55ns 的结果时，我迷失了更多。


    std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
    char _pad0[64 - sizeof(std::atomic<size_t>)];
    Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
    std::size_t _headCache { 0 }; // Head cache for the producer
    char _pad1[64 - sizeof(Buffer) - sizeof(std::size_t)];

    std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
    char _pad2[64 - sizeof(std::atomic<size_t>)];
    Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
    std::size_t _tailCache { 0 }; // Head cache for the consumer
    char _pad3[64 - sizeof(Buffer) - sizeof(std::size_t)];

这次我让队列推送/弹出在 40 到 55ns 之间振荡。
在这一点上我很迷茫，因为我不知道我应该去哪里寻找答案。到目前为止，C++ 内存布局对我来说非常直观，但我意识到我仍然错过了在高频多线程方面做得更好的非常重要的知识。
最少的代码示例
如果你想编译整个代码来自己测试，这里需要几个文件:
SPSCQueue.hpp:


#pragma once

#include <atomic>
#include <cstdlib>
#include <cinttypes>

#define KF_ALIGN_CACHELINE alignas(kF::Core::Utils::CacheLineSize)

namespace kF::Core
{
    template<typename Type>
    class SPSCQueue;

    namespace Utils
    {
        /** @brief Helper used to perfect forward move / copy constructor */
        template<typename Type, bool ForceCopy = false>
        void ForwardConstruct(Type *dest, Type *source) {
            if constexpr (!ForceCopy && std::is_move_assignable_v<Type>)
                new (dest) Type(std::move(*source));
            else
                new (dest) Type(*source);
        }

        /** @brief Helper used to perfect forward move / copy assignment */
        template<typename Type, bool ForceCopy = false>
        void ForwardAssign(Type *dest, Type *source) {
            if constexpr (!ForceCopy && std::is_move_assignable_v<Type>)
                *dest = std::move(*source);
            else
                *dest = *source;
        }

        /** @brief Theorical cacheline size */
        constexpr std::size_t CacheLineSize = 64ul;
    }
}

/**
 * @brief The SPSC queue is a lock-free queue that only supports a Single Producer and a Single Consumer
 * The queue is really fast compared to other more flexible implementations because the fact that only two thread can simultaneously read / write
 * means that less synchronization is needed for each operation.
 * The queue supports ranged push / pop to insert multiple elements without performance impact
 *
 * @tparam Type to be inserted
 */
template<typename Type>
class kF::Core::SPSCQueue
{
public:
    /** @brief Buffer structure containing all cells */
    struct Buffer
    {
        Type *data { nullptr };
        std::size_t capacity { 0 };
    };

    /** @brief Local thread cache */
    struct Cache
    {
        Buffer buffer {};
        std::size_t value { 0 };
    };

    /** @brief Default constructor initialize the queue */
    SPSCQueue(const std::size_t capacity);

    /** @brief Destruct and release all memory (unsafe) */
    ~SPSCQueue(void) { clear(); std::free(_buffer.data); }

    /** @brief Push a single element into the queue
     *  @return true if the element has been inserted */
    template<typename ...Args>
    [[nodiscard]] inline bool push(Args &&...args);

    /** @brief Pop a single element from the queue
     *  @return true if an element has been extracted */
    [[nodiscard]] inline bool pop(Type &value);

    /** @brief Clear all elements of the queue (unsafe) */
    void clear(void);

private:
    KF_ALIGN_CACHELINE std::atomic<size_t> _tail { 0 }; // Tail accessed by both producer and consumer
    struct {
        Buffer _buffer {}; // Buffer cache for the producer, equivalent to _buffer2
        std::size_t _headCache { 0 }; // Head cache for the producer
        char _pad0[Utils::CacheLineSize - sizeof(Buffer) - sizeof(std::size_t)];
    };

    KF_ALIGN_CACHELINE std::atomic<size_t> _head { 0 }; // Head accessed by both producer and consumer
    struct{
        Buffer _buffer2 {}; // Buffer cache for the consumer, equivalent to _buffer2
        std::size_t _tailCache { 0 }; // Head cache for the consumer
        char _pad1[Utils::CacheLineSize - sizeof(Buffer) - sizeof(std::size_t)];
    };

    /** @brief Copy and move constructors disabled */
    SPSCQueue(const SPSCQueue &other) = delete;
    SPSCQueue(SPSCQueue &&other) = delete;
};

static_assert(sizeof(kF::Core::SPSCQueue<int>) == 4 * kF::Core::Utils::CacheLineSize);

template<typename Type>
kF::Core::SPSCQueue<Type>::SPSCQueue(const std::size_t capacity)
{
    _buffer.capacity = capacity;
    if (_buffer.data = reinterpret_cast<Type *>(std::malloc(sizeof(Type) * capacity)); !_buffer.data)
        throw std::runtime_error("Core::SPSCQueue: Malloc failed");
    _buffer2 = _buffer;
}

template<typename Type>
template<typename ...Args>
bool kF::Core::SPSCQueue<Type>::push(Args &&...args)
{
    static_assert(std::is_constructible<Type, Args...>::value, "Type must be constructible from Args...");

    const auto tail = _tail.load(std::memory_order_relaxed);
    auto next = tail + 1;

    if (next == _buffer.capacity) [[unlikely]]
        next = 0;
    if (auto head = _headCache; next == head) [[unlikely]] {
        head = _headCache = _head.load(std::memory_order_acquire);
        if (next == head) [[unlikely]]
            return false;
    }
    new (_buffer.data + tail) Type{ std::forward<Args>(args)... };
    _tail.store(next, std::memory_order_release);
    return true;
}

template<typename Type>
bool kF::Core::SPSCQueue<Type>::pop(Type &value)
{
    const auto head = _head.load(std::memory_order_relaxed);

    if (auto tail = _tailCache; head == tail) [[unlikely]] {
        tail = _tailCache = _tail.load(std::memory_order_acquire);
        if (head == tail) [[unlikely]]
            return false;
    }
    auto *elem = reinterpret_cast<Type *>(_buffer2.data + head);
    auto next = head + 1;
    if (next == _buffer2.capacity) [[unlikely]]
        next = 0;
    value = std::move(*elem);
    elem->~Type();
    _head.store(next, std::memory_order_release);
    return true;
}

template<typename Type>
void kF::Core::SPSCQueue<Type>::clear(void)
{
    for (Type type; pop(type););
}

基准，使用 google benchmark .
bench_SPSCQueue.cpp:

#include <thread>

#include <benchmark/benchmark.h>

#include "SPSCQueue.hpp"

using namespace kF;

using Queue = Core::SPSCQueue<std::size_t>;

constexpr std::size_t Capacity = 4096;

static void SPSCQueue_NoisyPush(benchmark::State &state)
{
    Queue queue(Capacity);
    std::atomic<bool> running = true;
    std::size_t i = 0ul;
    std::thread thd([&queue, &running] { for (std::size_t tmp; running; benchmark::DoNotOptimize(queue.pop(tmp))); });
    for (auto _ : state) {
        decltype(std::chrono::high_resolution_clock::now()) start;
        do {
            start = std::chrono::high_resolution_clock::now();
        } while (!queue.push(42ul));
        auto end = std::chrono::high_resolution_clock::now();
        auto elapsed = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
        auto iterationTime = elapsed.count();
        state.SetIterationTime(iterationTime);
    }
    running = false;
    if (thd.joinable())
        thd.join();
}
BENCHMARK(SPSCQueue_NoisyPush)->UseManualTime();

static void SPSCQueue_NoisyPop(benchmark::State &state)
{
    Queue queue(Capacity);
    std::atomic<bool> running = true;
    std::size_t i = 0ul;
    std::thread thd([&queue, &running] { while (running) benchmark::DoNotOptimize(queue.push(42ul)); });
    for (auto _ : state) {
        std::size_t tmp;
        decltype(std::chrono::high_resolution_clock::now()) start;
        do {
            start = std::chrono::high_resolution_clock::now();
        } while (!queue.pop(tmp));
        auto end = std::chrono::high_resolution_clock::now();
        auto elapsed = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);
        auto iterationTime = elapsed.count();
        state.SetIterationTime(iterationTime);
    }
    running = false;
    if (thd.joinable())
        thd.join();
}
BENCHMARK(SPSCQueue_NoisyPop)->UseManualTime();

最佳答案

感谢您的有用评论(主要是感谢 Peter Cordes)，问题似乎来自 L2 数据预取器。
由于我的 SPSC 队列设计，每个线程必须访问两个连续的缓存行来推送/弹出队列。
如果结构本身未与 128 字节对齐，则其地址将不会在 128 字节上对齐，并且编译器将无法优化两个对齐缓存行的访问。
因此，简单的修复是:

template<typename Type>
class alignas(128) SPSCQueue { ... };

Here (section 2.5.5.4 Data Prefetching)是 Intel 的一篇有趣的论文，解释了对其架构的优化以及如何在不同级别的缓存中完成预取。

关于c++ - 使用 alignas 防止错误共享被破坏，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63706666/

文章推荐： c++ - 与 LLVM LLD 链接不起作用 [macos]

文章推荐： kubernetes - 使用 kubectl run 创建带有卷的 kubernetes pod

文章推荐： java - 如何从 map 对象获取列表值

文章推荐： c++ - 如何将转换应用于条件运算符中的位字段类型？

c# - 是否可以使用 .Net 框架以编程方式记录对 Windows 共享(SMB 共享)的访问？
只是想知道是否有可能找出谁从 Windows 共享中读取了文件(最好使用 .NET，但 win32 native 可以)？我想做的是创建类似 awstats 的东西对于 Windows 共享，这样我
Android - 共享
是否可以列出 Intent.ACTION_SEND ？我的意思是我需要知道是否有人通过 action_send 在 Facebook 上分享或在 Twitter 上发推文。最佳答案也许你想要一个更
java - 谷歌日历(共享)
我正在使用 Google Apps 应用程序。实际上，我想在不使用密码的情况下访问另一个 ID。我使用了 OAuth，它运行良好。但我无法分享特定人的日历。我尝试了以下代码。 GoogleOAuthP
iphone - 共享 MACintosh？
我怎样才能只创建模拟器...可能吗？我知道，设备需要分发证书。最佳答案您只需将应用程序目录从 iPhone 模拟器复制到另一个实例/操作系统版本，它就应该可以工作。因此，如果您想分发 3.1.3
maven - Docker多阶段构建和上一阶段的安装/共享
我想使用多阶段构建来避免每次构建应用程序时都下载我的 Java 项目所需的所有 Maven 依赖项。我正在考虑在第一阶段解决 Maven 依赖项，然后在第二阶段构建应用程序，这将需要访问在前一阶段下
ios - 如何保护本地资源免遭未经授权的传输(共享)
我正在寻找保护用户下载内容的初步想法。用户下载充满有趣资源的 zip 文件，这些资源被提取到本地文件系统中以供应用程序使用。我的目标是防止用户通过互联网将下载的资源共享给其他用户(假设他们获得了对文件
桌面和移动网站应用程序之间的 session 共享？
我想知道在具有移动和桌面版本的网站上共享身份验证、 session 管理等的最佳方法是什么。我们正在运行 Tomcat，并且更愿意将移动站点和桌面站点的应用程序保持在不同的节点上。我看过类似的帖子，
c++ - (共享)指向单例的指针
我发现了这个单例的实现。我怎样才能创建指向它的指针或共享指针？` 为什么这不起作用？自动测试 = Singleton::Instance(); class Singleton { public: st
virtualenv - 共享 Virtualenv 环境定义
我有一个 heroku 项目，我想与其他人分享。作为the instructions describe ，我使用 virtualenv 来管理环境和依赖项。有没有办法在新机器上从 requiremen
maven - 共享 Maven 本地仓库
Maven 将所有 jar 存储在本地存储库 ~/.m2/repository/ 下。用户多时占用空间大。那么，是否可以由多个用户共享这个本地存储库，或许在不同的目录结构下？最佳答案简单的回答
javascript - 共享 worker 在重新加载页面时终止
为什么共享 worker 在重新加载页面时死了？应该是复活了我该如何解决这个问题？重装前重新加载后(在example.com上按F5) parent worker var port = new S
多个应用程序中的 Angular 共享 Assets
我正在开发多个小型应用程序，这些应用程序将共享通用和共享模块和 Assets 。关于如何创建项目结构的部分在这里回答:https://stackoverflow.com/a/61254557/135
jenkins - 如何从另一台计算机访问/共享 Jenkins？
我在 RHEL 上安装了 jenkins (localhost:8080)，我能够成功地构建代码现在，我想设置主/从代理。我的笔记本电脑将充当“Master Jenkins”，而我同事的笔记本电脑
Android 共享 Intent EXTRA_STREAM
我有这种方法可以根据我使用的 EXTRA_STREAM 共享文本文件或图片。我有这两个我可以选择 i.putExtra(Intent.EXTRA_STREAM, uri); i.putExtra(In
r - 共享 Bookdown 书而不公开
我正在使用 R 中的一个数据分析项目，我正在使用 R 中的敏感私有(private)数据进行一些逻辑和多级建模。我爱上了。预订包，我已经创建了一本关于我们的工作流程和分析管道的相当广泛的书。问题是
iPhone 工具栏由多个 View 共享
我正在构建的应用程序需要在 UITabBarController 框架内为多个 View (及其 subview )显示共享的自定义 UIToolbar。自定义工具栏的内容在所有 View 中都是相同
javascript - 共享 eslint 配置找不到节点模块
我有多个应用程序，我想共享相同的 eslint 配置: - project_root/ - app1/ - node_modules/ - eslint.rc
electron - 共享 Electron 主过程
我有多个 Electron 应用程序。一个是主应用程序，其他几个功能应用程序。主应用程序上的按钮很少，这将导致功能应用程序打开。这里的问题是每个应用程序都有一个主进程，该进程导致要利用更多的CPU。是
javascript - 共享 SockJS 连接
我正在开发一个 Node.js 后端，它通过 websocket 与一些桌面客户端进行通信，而服务器端的通信是从 Web 前端发起的。一切正常，因为我将 SockJS Connection 实例存储在
ssh - 共享 SSH key
我对托管多个网站的服务器上的多个用户帐户使用私有(private) SSH key 和无密码条目。我为每个用户帐户使用相同的私钥。 (因为我很懒？或者那是“正确”的方式)。我现在想授权该国不同地区

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - 使用 alignas 防止错误共享被破坏