c++ - 使用C++按字典顺序在一个巨大的文件中对文本行进行外部排序-6ren

c++ - 使用C++按字典顺序在一个巨大的文件中对文本行进行外部排序

转载作者：行者123 更新时间：2023-12-01 14:54:30

我们有200Gb文件，其中包含用“\ n”分隔的文本行。我们的服务器具有板载Linux，gcc，8GB RAM和无限的硬盘空间。
从我们的 Angular 来看，要求是在C++中实现最有效的方法来按字典顺序对文件的行进行排序。

我已经按照字典顺序正确地对输入文件(最大2GB)进行了文本行的外部排序。但是，当我使用大于2GB的文件进行测试时，输出不正确。必须测试最大文件大小大于200GB。

以下是我的代码，文件大小小于2GB时效果很好。
程序具有3个参数作为命令行参数:
输入文件名，输出文件名和内存限制(以字节为单位)(我们将测试不同的内存限制，不仅是8GB)
该程序应该能够在简单的边界情况下正常工作，例如与
小文件，空文件，大于200GB的文件。它应该与非ascii符号一起很好地工作，但是
我们可以假设文件中不存在零字节。我们还可以假设内存限制远大于最长行的大小。

这是代码。请帮助我指出文件大小大于2GB不能产生正确结果的原因。如果您能在下面给我修改后的代码的版本，使其在所有情况下均有效，将不胜感激。

#include <stdio.h>
#include <stdlib.h>
#include <fstream>
#include <sstream>
#include <iostream> 
#include <algorithm>
#include <vector>
#include <string>
#include <iterator>

//using namespace std;

#define FILESIZE  200000000000          // size of the file on disk
#define TOTAL_MEM 7000000000            // max items the memory buffer can hold

typedef std::vector<std::string> DataContainer;


// Function to sort lines of text according to their length
/*bool compare(std::string &a, std::string &b) 
{ 
    if (a.size() < b.size())
        return true;
    else if (a.size() > b.size())
        return false;
    else
        return a < b;

    //return strcasecmp( a.c_str(), b.c_str() ) <= 0;
} 

long estimateSizeOfBlocks(long sizeoffile, int maxtmpfiles, long maxMemory) 
{

}
*/
std::string ToString(int val) {
    std::stringstream stream;
    stream << val;
    return stream.str();
}

void ExternalSort(std::string infilepath, std::string outfilepath)
{
    std::string line, new_line;
    std::string tfile = "temp-file-";
    int i, j, k;
    DataContainer v;

    std::ifstream infile;

    if(!infile.is_open())
    {   
        std::cout << "Unable to open file\n";           
    }
    infile.open(infilepath, std::ifstream::in);

    // Check if input file is empty
    if (infile.peek() == std::ifstream::traits_type::eof())
    {
        std::cout << "File is empty";
    }

    //std::vector<std::string> x(std::istream_iterator<std::string>(infile), {});

    // Get size of the file
    infile.seekg (0, infile.end);
    int input_file_size = infile.tellg();
    infile.seekg (0, infile.beg);
    std::cout << "The size of the file chosen is (in bytes): " << input_file_size << "\n";

    // Estimate required number of runs to read all data in the input file
    int runs_count;
    if(input_file_size % TOTAL_MEM > 0)
        runs_count = input_file_size/TOTAL_MEM + 1;     // if file size is fixed at 200GB, we just need to define FILESIZE: runs_count = FILESIZE/TOTAL_MEM
    else
        runs_count = input_file_size/TOTAL_MEM;

    // Iterate through the elements in the file
    for(i = 0; i < runs_count; i++)
    {
        // Step 1: Read M-element chunk at a time from the file
        for (j = 0; j < (TOTAL_MEM < input_file_size ? TOTAL_MEM : input_file_size); j++)
        {
            while(std::getline(infile, line))
            {
                // If line is empty, ignore it
                if(line.empty())
                    continue;
                new_line = line + "\n";
                // Line contains string of length > 0 then save it in vector
                if(new_line.size() > 0)
                    v.push_back(new_line);
            }   
        }
        // Step 2: Sort M elements
        sort(v.begin(), v.end());       //sort(v.begin(), v.end(), compare); 

        // Step 3: Create temporary files and write sorted data into those files.
        std::ofstream tf;
        tf.open(tfile + ToString(i) + ".txt", std::ofstream::out | std::ofstream::app); 
        std::ostream_iterator<std::string> output_iterator(tf, "\n");
        std::copy(v.begin(), v.end(), output_iterator);

    /*  for(vector<std::string>::const_iterator it = v.begin(); it != v.end(); ++it)
            tf << *it << "\n";
        */
        tf.close();
    }   
    infile.close();

    // Step 4: Now open each file and merge them, then write back to disk
    DataContainer vn;
    for(i = 0; i < runs_count; i++)     
    {
        std::ifstream sortedFiles[runs_count];                              
        std::ostringstream filename;
        filename << tfile << i << ".txt";
        sortedFiles[i].open(filename.str(), std::ifstream::in);
        //sortedFiles[i].open(tfile + ToString(i) + ".txt", std::ifstream::in); 

        std::string st, new_st;
        while(std::getline(sortedFiles[i], st))
        {
            // If line is empty, ignore it
            if(st.empty())
                continue;
            new_st = st + "\n";
            if(new_st.size() > 0)
                vn.push_back(new_st);
        }
        sortedFiles[i].close();
    }
    // Step 5: Merge total sorted data
    sort(vn.begin(), vn.end());

    std::ofstream outfile;
    outfile.open(outfilepath, std::ofstream::out | std::ofstream::app);     
    std::ostream_iterator<std::string> output_iterator(outfile, "\n");
    std::copy(vn.begin(), vn.end(), output_iterator);

    outfile.close();    // Close final sorted file

}

int main(int argc, char** argv)         //      e.g.     argv[1] = "E:\\data.txt"           argv[2] = "E:\\sorted_data.txt"
{

    std::cout << "You have entered " << argc 
         << " arguments:\n"; 

    for (int i = 0; i < argc; ++i) 
        std::cout << argv[i] << "\n"; 

    ExternalSort(argv[1], argv[2]);

    return 0; 
}

最佳答案

有关外部排序的完整实现，请参见下面的编辑-无保修。

假设您有两个升序排序文件(每个数字是一行):

文件A:1 3 5 7

文件B:10 2 4 6 8

预期结果是文件C:1 10 2 3 4 5 6 7 8

如果您只是像用那样将连接到文件，则将得到:1 3 5 7 10 2 4 6 8;这不是您所期望的。

您应该从两个文件中读取并编写词典最小的行。使用包含相同行数和交替行(例如1 3 5和2 4 6)的文件，可以很容易地理解这一点:
从A和B读取第一行。
将最小的内容写入C.
从文件中读取写入编号所属的下一行。
转到2。

如果的行数不同，则到达一个文件的末尾后，写入第二个文件的其余部分:
File A: 1 3 5 7 File B: 2 4 8 A B C -------- 1 2 -> 1 3 2 -> 2 3 4 -> 3 5 4 -> 4 5 8 -> 5 7 8 -> 7 8 -> 8

一旦有了可以合并两个文件的算法，就可以在一对文件上运行它，直到只有一个文件结束。假设您对文件A，B，C和D进行了排序:
将A和B合并为AB。
删除A和B。
将C＆D合并到CD中。
删除C＆D。
将AB＆CD合并为ABCD。 这就是结果。
删除AB和CD。
A B C D | | | | \ / \ / AB CD | | \ / ABCD

这是匆忙编写的双向合并-不作保证:
#include <iostream> #include <iostream> #include <fstream> #include <string> #include <algorithm> class two_way_merge_t { public: two_way_merge_t(std::istream& a, std::istream& b, std::ostream& c) : a_{ a }, b_{ b }, c_{ c } { // nop } void execute() { using namespace std; // get first line from a; assumes a is not empty a_.get(); // get first line from b; assumes b is not empty b_.get(); // write whichever is less and advance while (write_one()) ; } protected: // helper struct reader_t { std::istream& is_; std::string line_; bool get() { return (bool)std::getline(is_, line_); } void write(std::ostream& os) { os << line_ << std::endl; } void write_rest(std::ostream& os) { do write(os); while (get()); } }; reader_t a_, b_; std::ostream& c_; bool write_one() { using namespace std; if (lexicographical_compare(a_.line_.begin(), a_.line_.end(), b_.line_.begin(), b_.line_.end())) { a_.write(c_); if (!a_.get()) return b_.write_rest(c_), false; } else { b_.write(c_); if (!b_.get()) return a_.write_rest(c_), false; } return true; } }; void validate(std::istream& is) // lines count not validated { using namespace std; string prev; string cur; while (getline(is, cur)) { if (!lexicographical_compare(prev.begin(), prev.end(), cur.begin(), cur.end())) throw "bad two-way merge"; swap(prev, cur); } } int main() { using namespace std; { ifstream a("c:\\temp\\a.txt"); if (!a) return -1; ifstream b("c:\\temp\\b.txt"); if (!b) return -2; ofstream c("c:\\temp\\c.txt"); if (!c) return -3; two_way_merge_t twm(a, b, c); twm.execute(); } ifstream c("c:\\temp\\c.txt"); if (!c) return -4; try { validate(c); } catch (const char* e) { cout << e; return -5; } }

E D I T
#include <iostream> #include <fstream> #include <string> #include <algorithm> #include <vector> #include <filesystem> //------------------------------------------------------------------- class two_way_merge_2_t { public: two_way_merge_2_t(std::vector<std::string>& a, std::istream& b, std::ostream& c) : a_{ a }, b_{ b }, c_{ c } { // nop } void execute() { using namespace std; // get first line from a; assumes a is not empty if ( ! a_.get() ) return b_.write_rest(c_); // get first line from b; assumes a is not empty if (!b_.get()) return a_.write_rest(c_); // we have managed to read from a and b: write whichever is less and advance while (write_one()) ; } protected: // helper struct isreader_t { std::istream& is_; std::string line_; bool get() { return (bool)std::getline(is_, line_); } void write(std::ostream& os) { os << line_ << std::endl; } void write_rest(std::ostream& os) { do write(os); while (get()); } const std::string& line() const { return line_; } }; // helper struct vreader_t { std::vector<std::string>& v_; std::vector<std::string>::size_type i_ = std::vector<std::string>::size_type (-1); bool get() { return ++i_ < v_.size(); } void write(std::ostream& os) { os << v_[i_] << std::endl; } void write_rest(std::ostream& os) { do write(os); while (get()); } const std::string& line() const { return v_[i_]; } }; vreader_t a_; isreader_t b_; std::ostream& c_; bool write_one() { using namespace std; if (lexicographical_compare(a_.line().begin(), a_.line().end(), b_.line().begin(), b_.line().end())) { a_.write(c_); if (!a_.get()) return b_.write_rest(c_), false; } else { b_.write(c_); if (!b_.get()) return a_.write_rest(c_), false; } return true; } }; //------------------------------------------------------------------- class partial_sort_t { public: partial_sort_t(std::istream& a, std::istream& b, std::ostream& c, std::size_t mem) : a_{ a } , b_{ b } , c_{ c } , mem_{ mem } { // nop } void execute() { sort(); two_way_merge_2_t(va_, b_, c_).execute(); } protected: std::istream& a_; std::istream& b_; std::ostream& c_; const std::size_t mem_; std::vector<std::string> va_; void sort() { std::streamsize max_gpos = a_.tellg() + std::streamsize(mem_); std::string line; va_.clear(); while (a_.tellg() < max_gpos && std::getline(a_, line) ) va_.push_back(std::move(line)); std::sort(va_.begin(), va_.end()); } }; //------------------------------------------------------------------- class external_sort_t { public: external_sort_t(const char* fn, const std::size_t mem) : is_{ fn }, mem_{ mem } { // nop } const char* execute() { make_b(); make_c(); while (is_) { partial_sort_t(is_, b_, c_, mem_).execute(); swap_bc(); } return done(); } protected: const std::size_t mem_; std::ifstream is_; std::ifstream b_; std::string bpath_; std::ofstream c_; std::string cpath_; static std::string make_temp_file() { std::string fn(512, 0); tmpnam_s(&fn.front(), fn.size()); std::ofstream os(fn); os.close(); return fn; } void make_b() { b_.close(); bpath_ = make_temp_file(); b_.open(bpath_); } void make_c() { c_.close(); cpath_ = make_temp_file(); c_.open(cpath_); } void swap_bc() { b_.close(); c_.close(); std::swap(bpath_, cpath_); b_.open(bpath_); c_.open(cpath_); } const char* done() { b_.close(); c_.close(); std::experimental::filesystem::remove(cpath_); return bpath_.c_str(); } }; void validate(std::istream& is) // lines count not validated { using namespace std; string prev; string cur; while (getline(is, cur)) { if ( strcmp(prev.c_str(), cur.c_str()) > 0 ) throw "bad two-way merge"; swap(prev, cur); } } void validate(const char* fn) { using namespace std; ifstream is{ fn }; if (!is) throw "cannot open"; validate(is); } void test_external_sort() { using namespace std; external_sort_t es{ "c:\\temp\\a.txt", 10 }; std::string rfn = es.execute(); try { validate(rfn.c_str()); } catch (const char* e) { cout << e << endl; } } int main() { test_external_sort(); }

关于c++ - 使用C++按字典顺序在一个巨大的文件中对文本行进行外部排序，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58872587/

文章推荐： c++ - 如何删除在另一个函数中初始化的动态分配的数组？

文章推荐： c++ - 错误C2079 'std::pair,Dummy>::first'使用未定义的类 'Dummy'

文章推荐： c++ - VB.NET DLL中的C++ DLL函数复制

文章推荐： c++ - 带有 Cuda 的 LibTorch(C++) 引发异常

android - 无法在共享首选项中存储字符串(巨大)
我通过在共享首选项中使用 GSON 将其转换为 json 来存储我的复杂对象。但是在检索它时，无法获得预期的字符串。代码这里 holderListCustomizationMap 是对象的复杂映射
rust - 如何遍历(巨大)压缩文件的行？
因此，我正在尝试对大于可用RAM的gz压缩文件执行某种面向行的操作，因此排除了将其首先读取为字符串的情况。问题是，如何在rust(缺少gunzip file.gz|./my-rust-program)
javascript - 巨大 float 的国际格式
我试图更好地理解为什么具有潜在大精度的大数字处理不一致，特别是在 JavaScript 及其本地化工具(例如 ECMA-402/Intl)中。我假设这与 float 的使用有关，但我想了解限制在哪里和
mysql - CSV(巨大)到基于网络的数据库
我们有一个 5GB 的 csv 文件，这是我们业务的主列表。有多个类别，每个类别包含数千条记录。我们的目标是将每个类别导出为其自己的 csv 文件。我们如何运行查询并导出数据？运行 OSX。有没
R:巨大(> 20GB)文件的xmlEventParse期间的内存管理
基于上一个问题 ( see here )，我试图通过 xmlEventParse 读取许多大型 xml 文件，同时保存节点变化数据。使用此示例 xml:https://www.nlm.nih.gov/
巨大 CSV 文件的 Java 内存问题
我正在开发一个系统，它加载一个巨大的 CSV 文件(超过 100 万行)并保存到数据库中。每行也有超过一千个字段。 CSV 文件被视为一个批处理，每一行都被视为其子对象。在添加对象的过程中，每个对象都
javascript - Browserify 产生一个*巨大*的输出文件
借助node-google模块我编写了一个简单的 Node 模块来为我的网络应用程序启用“文本网络搜索”功能，并在我的一个 View 中显示结果。由于在来自同一 IP 的少量查询后 Google
python - 巨大 numpy (HDF5) 阵列的统计数据
我有相当大的 4D 阵列 [20x20x40x15000]，我使用 h5py 将其作为 HDF5 文件保存到磁盘.现在的问题是我想计算整个数组的平均值，即使用: numpy.average(HDF5_
巨大 CString 的 C++ 串联
我在遗留代码库中连接巨大的 CString 时遇到问题。 CStrings 可以包含 base64 编码的文件，因此可能很大。在某些时候，这些 CString 会像这样连接起来: result +=
python - 使用 django 从远程提供可下载文件(巨大)
我正在尝试让我的服务器提供来自另一台服务器的巨大文件。但是，为了保护我的凭据免受该远程服务器的攻击，我不能简单地将请求者重定向到文件 url；另一方面，虽然使用 StreamingHttpRespon
mongodb - 为什么我的 mongo local db oplog 巨大
感谢对此的任何见解，我有 2 个问题: 1) 弄清楚为什么我的本地数据库 oplog 庞大且不断增长 2) 安全删除(或重置)我的 local.oplog 以释放 18 GB 的浪费空间场景:我一直
javascript - 在 Node JS 中获取、操作和编写(响应)巨大 JSON 的最佳方式？
我的预期任务:获取大量数据(1 GB 及更多大小)json 字符串，操作(进行一些格式化、解析 json、重组 json 数据)并写入新格式化的 json 字符串作为响应。处理这种情况的更好方法是什么
Angular 2 和 Angular Cli tree shaking 生产构建，vendor.js 巨大
我做了一个小的 Angular 4 应用程序，但我不知道如何应用 tree shaking 和 aot 编译。我运行的命令如下: ng build --prod --aot 但我得到的结果仍然很大，供

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - 使用C++按字典顺序在一个巨大的文件中对文本行进行外部排序