c++ - Mongodb insert_many 性能

c++ - Mongodb insert_many 性能 - C++

转载作者：太空宇宙更新时间：2023-11-04 12:55:22

我目前正在尝试使用 C++ 应用程序最大限度地提高插入的写入速度到 MongoDB 中。我看到的行为是 insert_many() 操作会变慢，导致写队列建立，然后后续的 insert_many() 操作现在有更多要插入。我做了一个小的示例应用程序来演示这个问题。示例程序定义为两个线程:

主线程将读取一个字典文件(每行一个单词)并计算单词中每个字母的频率并将每个字母的结果粘贴到一个 vector 中，然后向工作线程发送信号
工作线程将交换一个线程安全的双缓冲区，然后迭代 vector ，将每个元素变成一个文档。迭代整个 vector 后，我们将为每个字母(集合)进行批量插入。

struct CountData {
    CountData(const size_t p_index, const std::string& p_word, const size_t p_count)
        : index(p_index)
        , word(p_word)
        , count(p_count)
    {
    }

    const size_t index = 0;
    const std::string word;
    const int32_t count = 0;
};

struct CollectionData {
    CollectionData(const std::string& collectionName) : name(collectionName) {
        options.ordered(false);
        auto writeConcern = mongocxx::write_concern{};
        writeConcern.acknowledge_level(mongocxx::write_concern::level::k_unacknowledged);
        options.write_concern(writeConcern);
    }

    void push_back(const bsoncxx::document::value& value) { documents.push_back(value); }
    size_t size() const { return documents.size(); }

    void writeAll(mongocxx::pool& pool) {
        auto client = pool.acquire();
        auto collection = (*client)["frequency"][name];
        collection.insert_many(documents, options);
    }
    void clear() { documents.clear(); }

private:
    const std::string name;
    mongocxx::options::insert options;
    std::vector<bsoncxx::document::value> documents;
};

class FrequencyCounter {
public:
    FrequencyCounter(const std::string& mongoUri, const std::string& dictionaryFile)
        : _collectionNames({ "A", "B", "C", "D", "E", "F", "G", "H", "I",
                             "J", "K", "L", "M", "N", "O", "P", "Q", "R",
                             "S", "T", "U", "V", "W", "X", "Y", "Z" })
        , _mongoPool(mongocxx::uri(mongoUri))
        , _dictionary(dictionaryFile)
    {
        for(const auto& name : _collectionNames) {
            _collections.push_back(name);
        }
        _thread = std::thread(&FrequencyCounter::workerThread, this);
    }

    ~FrequencyCounter() {
        _isRunning = false;
        _event.notify_one();
        _thread.join();
    }

    void Run() {
        std::ifstream inFile(_dictionary);
        if(!inFile.is_open()) {
            std::cerr << "Could not open definition file: " << _dictionary << std::endl;
            std::exit(-1);
        }
        std::string line;

        while(std::getline(inFile, line)) {
            std::string word = line;
            std::transform(word.begin(), word.end(), word.begin(), ::toupper);
            size_t index = 0;
            for(const auto& letter : _collectionNames) {
                size_t count = std::count(word.begin(), word.end(), letter[0]);
                if(count > 0)
                    _dataQueue.addPending(CountData(index, word, count));
                ++index;
            }
            _event.notify_one();
        }
    }

private:
    void writeData(const bool flush=false) {
        if(!_dataQueue.trySwap())
            return; // No data to write
        const auto& dataQueue = _dataQueue.active();
        for(const auto& data : dataQueue) {
            const uint64_t begin = DateTime::now();
            auto doc = bsoncxx::builder::basic::document{};
            doc.append(bsoncxx::builder::basic::kvp("word", data.word));
            doc.append(bsoncxx::builder::basic::kvp("count", data.count));
            _collections[data.index].push_back(doc.extract());
            const uint64_t end = DateTime::now();
            _docCreationTimes.emplace_back(end - begin);
        } 

        for(auto& collection : _collections) {
            const size_t currentSize = collection.size();
            if(flush || currentSize >= _maxDocQueueSize) {
                const uint64_t begin = DateTime::now();
                collection.writeAll(_mongoPool);
                const uint64_t end = DateTime::now();
                _docInsertionTimes.emplace_back(end - begin);
                collection.clear();
            }
        }
    }

    void workerThread() {
        try {
            while(_isRunning) {
                _event.wait();
                _event.reset();
                writeData();
            }
            const bool flush = true;
            writeData(flush);
        } catch(const std::exception& ex) {
            std::cerr << "Exception in thread: " << ex.what();
        }
        _isRunning = false;
        {
            uint64_t minTime = std::numeric_limits<uint64_t>::max();
            uint64_t maxTime = 0;
            uint64_t sumTime = 0;
            uint64_t count = 0;
            for(const auto& time : _docCreationTimes) {
                if(time < minTime)
                    minTime = time;
                if(time > maxTime)
                    maxTime = time;
                sumTime += time;
                ++count;
            }
            std::cout << "Doc Creation Time (avg): " << lPadd(std::to_string(sumTime / count), '0', 12) << "ns" << std::endl;
            std::cout << "Doc Creation Time (min): " << lPadd(std::to_string(minTime), '0', 12) << "ns" << std::endl;
            std::cout << "Doc Creation Time (max): " << lPadd(std::to_string(maxTime), '0', 12) << "ns" << std::endl;
        }
        {
            uint64_t minTime = std::numeric_limits<uint64_t>::max();
            uint64_t maxTime = 0;
            uint64_t sumTime = 0;
            uint64_t count = 0;
            for(const auto& time : _docInsertionTimes) {
                if(time < minTime)
                    minTime = time;
                if(time > maxTime)
                    maxTime = time;
                sumTime += time;
                ++count;
            }
            std::cout << "Doc Insertion Time (avg): " << lPadd(std::to_string(sumTime / count), '0', 12) << "ns" << std::endl;
            std::cout << "Doc Insertion Time (min): " << lPadd(std::to_string(minTime), '0', 12) << "ns" << std::endl;
            std::cout << "Doc Insertion Time (max): " << lPadd(std::to_string(maxTime), '0', 12) << "ns" << std::endl;
        }
    }

    const size_t _maxDocQueueSize = 10;
    const std::vector<std::string> _collectionNames;
    mongocxx::instance _mongoInstance;
    mongocxx::pool _mongoPool;
    std::string _dictionary;
    std::vector<CollectionData> _collections;
    AtomicVector<CountData> _dataQueue; // thread-safe double buffer
    std::vector<uint64_t> _docCreationTimes;
    std::vector<uint64_t> _docInsertionTimes;

    Event _event;
    volatile bool _isRunning = true;
    std::thread _thread;
};

int main(int argc, char* argv[]) {
    const std::string mongoUri = "mongodb://localhost:27017/?minPoolSize=50&maxPoolSize=50";
    const std::string dictionary = "words_alpha.txt";
    FrequencyCounter counter(mongoUri, dictionary);
    counter.Run();

    return 0;
}

结果:

Doc Creation  Time (avg): 000,000,000,837ns
Doc Creation  Time (min): 000,000,000,556ns
Doc Creation  Time (max): 000,015,521,675ns
Doc Insertion Time (avg): 000,087,038,560ns
Doc Insertion Time (min): 000,000,023,311ns
Doc Insertion Time (max): 005,407,689,435ns

我尝试了以下更改但没有成功:

创建一个在 FrequencyCounter 生命周期内保持打开状态的 Mongo 客户端
从池中取出并为 vector 中的每个项目执行 insert_one()(有和没有池)
为每个字母使用不同的数据库，并且仍然使用池和 insert_many()

我是否可以进行任何优化或更改，让工作线程能够跟上主线程的高吞吐量？

最佳答案

我意识到这是一个相对较老的问题，但您可能会看到使用 collection.bulk_write(bulk_write &bulk_write) 的性能改进在您的工作线程中插入记录。

这些是通过将一系列操作(mongocxx::model::insert_one、mongocxx::model::delete_one 等)附加到一个实例来创建的mongocxx::bulk_write (class reference docs)然后使用 collection.bulk_write(bulk_write) 执行一批准备好的操作。

Some nice examples can be found here

性能比较，

测试 1:

Inserted 100000 in 27263651us insert_one
Inserted 100000 in  1129957us insert_many
Inserted 100000 in   916561us insert_bulk

测试 2:

Inserted 100000 in 28196463us insert_one
Inserted 100000 in  1089758us insert_many
Inserted 100000 in   967773us insert_bulk

这些数字是使用下面的代码片段获得的(注意，mongocxx 驱动程序 v3.0.3，MongoDB v3.2):

struct msg {
    long num;
    long size;
    long time;
}

//using insert_one()
void store_msg_one(std::vector<msg> lst)
{
    for(int i = 0; i < lst.size(); i++)
    {
        msg cur_msg = lst[i];
        bsoncxx::builder::stream::document msg_info_builder{};
        msg_info_builder << "msg_num"   << cur_msg.num
                         << "msg_size"  << cur_msg.size
                         << "msg_time"  << cur_msg.time;        
        bsoncxx::document::value doc_val = msg_info_builder << bson::builder::stream::finalize;

        collection.insert_one(doc_val.view());      
    }
}

//using insert_many()    
void store_msg_many(std::vector<msg> lst)
{
    std::vector<bsoncxx::document::value> lst2; 
    for(int i = 0; i < lst.size(); i++)
    {
        msg cur_msg = lst[i];
        bsoncxx::builder::stream::document msg_info_builder{};
        msg_info_builder << "msg_num"   << cur_msg.num
                         << "msg_size"  << cur_msg.size
                         << "msg_time"  << cur_msg.time;        
        bsoncxx::document::value doc_val = msg_info_builder << bson::builder::stream::finalize;
        lst2.push_back(doc_val);
    }
    collection.insert_many(lst2);
}    

//using bulk_write()
void store_msg_bulk(std::vector<msg> lst)
{
    mongocxx::options::bulk_write bulk_opt;
    mongocxx::write_concern wc;
    bulk_opt.ordered(false); //see https://docs.mongodb.com/manual/core/bulk-write-operations/
    wc.acknowledge_level(mongocxx::write_concern::level::k_default);
    bulk_opt.write_concern(wc);

    mongocxx::bulk_write bulk = mongocxx::bulk_write{bulk_opt}; 
    for(int i = 0; i < list.size(); i++)
    {
        msg cur_msg = lst[i];
        bsoncxx::builder::stream::document msg_info_builder{};
        msg_info_builder << "msg_num"   << cur_msg.num
                         << "msg_size"  << cur_msg.size
                         << "msg_time"  << cur_msg.time;        

        bsoncxx::document::value doc_val = msg_info_insert << bsoncxx::builder::stream::finalize;
        mongocxx::model::insert_one msg_info_insert_op{doc_val.view()};
        bulk.append(msg_info_insert_op);
    }
    collection.bulk_write(bulk);
}

void main()
{
    std::vector<msg> lst;
    int num_msg = 100000;
    for(int i = 0; i < num_msg; i++)
    {
        msg info;
        info.time = 20*i;
        info.num = i;
        info.size = sizeof(i);
        lst.push_back(info);
    }

    //Test with insert_one(...)
    long long start_microsecs = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now().time_since_epoch()).count();
    store_msg_one(lst);
    long long end_microsecs = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now().time_since_epoch()).count();
    std::cout << "Inserted " << num_msg << " in " << end_microsecs - start_microsecs << "us" << " insert_one(...)" << std::endl;

    //Test with insert_many(...)
    start_microsecs = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now().time_since_epoch()).count();
    store_msg_many(lst);
    end_microsecs = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now().time_since_epoch()).count();    
    std::cout << "Inserted " << num_msg << " in " << end_microsecs - start_microsecs << "us" << " insert_one(...)" << std::endl;

    //Test with bulk_write(...)
    start_microsecs = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now().time_since_epoch()).count();
    store_msg_bulk(lst);
    end_microsecs = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now().time_since_epoch()).count();
    std::cout << "Inserted " << num_msg << " in " << end_microsecs - start_microsecs << "us" << " insert_bulk(...)" << std::endl;         
    std::cin.ignore();
}

注意:有关 bulk_write 选项的更多信息，请参阅 MongoDB docs

编辑:格式化

关于c++ - Mongodb insert_many 性能 - C++，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46917489/

文章推荐： c++ - 声明所有虚拟方法以使用 gtest 进行模拟？

文章推荐： c++ - 哪个事件属于 qt c++ 中的窗口焦点更改？

c++ - C c;之间有什么区别吗？和 C c = C();?
#include using namespace std; class C{ private: int value; public: C(){ value = 0;
c++ - C 风格字符串差异 : C/C++
这个问题已经有答案了: What is the difference between char a[] = ?string?; and char *p = ?string?;? (8 个回答) 已关闭
c++ - c\c++ 转换为 C#
关闭。此题需要details or clarity 。目前不接受答案。想要改进这个问题吗？通过 editing this post 添加详细信息并澄清问题. 已关闭 7 年前。此帖子已于 8 个月
c# - C、C++、C# 的功能测试工具
除了调试之外，是否有任何针对 c、c++ 或 c# 的测试工具，其工作原理类似于将独立函数复制粘贴到某个文本框，然后在其他文本框中输入参数？最佳答案也许您会考虑单元测试。我推荐你谷歌测试和谷歌模拟
c# - C/C++/C# 在监视器上设置窗口位置
我想在第二台显示器中移动一个窗口 (HWND)。问题是我尝试了很多方法，例如将分辨率加倍或输入负值，但它永远无法将窗口放在我的第二台显示器上。关于如何在 C/C++/c# 中执行此操作的任何线索最
c# - C/C++/C#中的DES实现
我正在寻找 C/C++/C## 中不同类型 DES 的现有实现。我的运行平台是Windows XP/Vista/7。我正在尝试编写一个 C# 程序，它将使用 DES 算法进行加密和解密。我需要一些实
c# - 在条件中使用赋值是否安全？ C/C++、C#
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开，visit the help center . 关闭 1
c++ - C/C++/C# 强制窗口在最上面
有没有办法强制将另一个窗口置于顶部？不是应用程序的窗口，而是另一个已经在系统上运行的窗口。 (Windows, C/C++/C#) 最佳答案 SetWindowPos(that_window_ha
c# - 套接字服务器应用程序的选择 : C/C++ or C#
假设您可以在 C/C++ 或 Csharp 之间做出选择，并且您打算在 Windows 和 Linux 服务器上运行同一服务器的多个实例，那么构建套接字服务器应用程序的最明智选择是什么？最佳答案如
c++ - C/C++ 运行时库和 C/C++ 标准库的区别
你们能告诉我它们之间的区别吗？顺便问一下，有什么叫C++库或C库的吗？最佳答案 C++ 标准库和 C 标准库是 C++ 和 C 标准定义的库，提供给 C++ 和 C 程序使用。那是那些词的共同
c++ - &C::c 和 &(C::c) 有什么区别？
下面的测试代码，我将输出信息放在注释中。我使用的是 gcc 4.8.5 和 Centos 7.2。 #include #include class C { public:
c++ - 什么 C++(通用 (c/c++) 与 (通用 c)/c++ )
很难说出这里问的是什么。这个问题是含糊的、模糊的、不完整的、过于宽泛的或修辞性的，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开它，visit the help center 。已关
c# - 通过网络在 C/C++ 服务器、C/C++ 和 C# 客户端之间发送数据结构
我的客户将使用名为 annoucement 的结构/类与客户通信。我想我会用 C++ 编写服务器。会有很多不同的类继承annoucement。我的问题是通过网络将这些类发送给客户端我想也许我应该使用
c# - C/C++ - 如何将 Buffer.BlockCopy (C#) 转换为 C/C++
我在 C# 中有以下函数: public Matrix ConcatDescriptors(IList> descriptors) { int cols = descriptors[0].Co
c++ - C/C++ - 对其他人隐藏 C 或 C++ 函数代码
我有一个项目要编写一个函数来对某些数据执行某些操作。我可以用 C/C++ 编写代码，但我不想与雇主共享该函数的代码。相反，我只想让他有权在他自己的代码中调用该函数。是否可以？我想到了这两种方法 - 在
c# - 在托管代码(C++、C、C++/CLI、C#)中使用非托管代码时处理错误
我使用的是编写糟糕的第 3 方 (C/C++) Api。我从托管代码(C++/CLI)中使用它。有时会出现“访问冲突错误”。这使整个应用程序崩溃。我知道我无法处理这些错误[如果指针访问非法内存位置等，
c# - C#、C/C++ 或 Objective-C 中的眼动追踪库
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题，以便用事实和引用来回答。关闭 7 年前。
c++ - C/C++/Objective-C 文本识别库
已关闭。此问题不符合Stack Overflow guidelines 。目前不接受答案。要求我们推荐或查找工具、库或最喜欢的场外资源的问题对于 Stack Overflow 来说是偏离主题的，因为
c# - 将 C/C++ 函数导入 C#
我有一些 C 代码，将使用 P/Invoke 从 C# 调用。我正在尝试为这个 C 函数定义一个 C# 等效项。 SomeData* DoSomething(); struct SomeData {
c - C语言中 "c -= --c - c++;"的结果应该是什么？
这个问题已经有答案了: Why are these constructs using pre and post-increment undefined behavior? (14 个回答) 已关闭 6

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - Mongodb insert_many 性能 - C++