C++ mmap到 "fast"读取与gzip文件的耦合-6ren

C++ mmap到 "fast"读取与gzip文件的耦合

转载作者：行者123 更新时间：2023-12-03 12:47:23

我对 C++ 很陌生，所以很抱歉，如果我问一些愚蠢的问题，但我在网上找不到答案(只有一篇引用 python ( Can mmap and gzip collaborate? ) 的帖子)，试图看看是否可以通过 mmap 读取 .GZ 文件() 函数(如下: Fast textfile reading in c++ )以便对文件进行一些操作并将其写入另一个文件。我需要根据某些列/字段值仅保留原始行和列的一部分，以便稍后检索它们并与来自不同主题的其他类似文件进行比较，以便提取相似性/差异。这些文件非常大(最大 10GB .GZ)，因此我尝试对 GZIP 文件使用快速比较方法。它更多的是与其他方法的“性能比较”。这是代码(抱歉，它很长，我认为很糟糕):

#include <algorithm>
#include <iostream>
#include <cstring>
#include <vector>
#include <typeinfo>

// for mmap:
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>

//for writefile
#include <fstream>

template <int N>
void emptyArray( char (&arrayName) [N] ) {
  std::fill( std::begin( arrayName ), std::end( arrayName ), 0 );
}

const char* map_file(const char* fname, size_t& length);

int main() {

//prende la dimensione del file da aprire
size_t length;
auto f = map_file("myfile.vcf", length);
auto l = f + length;


uintmax_t m_numLines = 0;

std::vector<int> v0;
std::vector<int> v1;
std::vector<int> v2;

for (int i=1; i<length; i++) {
  //vettore di posizioni # in prima posizione di una linea
  if (f[i] == '#' && f[i-1] == '\n') v0.push_back(i);
  //vettore di nuove linee
  if (f[i] == '\n') v1.push_back(i+1);
}

std::vector<int> inter;
set_intersection(v0.begin(), v0.end(),
                  v1.begin(), v1.end(),
                  back_inserter(inter));

v1.erase(set_difference(v1.begin(), v1.end(),
                        inter.begin(), inter.end(),
                        v1.begin()), v1.end());

v1.pop_back();


char chromArray[3];
char posArray[10];
char refArray[50];
char altArray[50];
char qualityArray[10];
char gtArray[4];
char gqxArray[5];
char dpArray[5];
char adArray[10];

//LOOP per NUM RIGA
//apro loop su vettore NL (non #)
for (int nl =0; nl<v1.size(); nl++) {

  //CONTATORI

  int ncol = 0;
  int chri = 0;
  int posi = 0;
  int refi = 0;
  int alti = 0;

  int qi = 0;
  int formatHeaderCount = 0;
  int formatLastCount = 0;
  int numGT = 0;
  int gti = 0;
  int numGQX = 0;
  int gqxi = 0;
  int numDP = 0;
  int dpi = 0;
  int numAD = 0;
  int adi = 0;

  std::string chromValue;
  emptyArray(chromArray);
  std::string posValue;
  emptyArray(posArray);
  std::string refValue;
  emptyArray(refArray);
  std::string altValue;
  emptyArray(altArray);
  std::string quality;
  emptyArray(qualityArray);
  std::string gtValue;
  emptyArray(gtArray);
  std::string gqxValue;
  emptyArray(gqxArray);
  std::string dpValue;
  emptyArray(dpArray);
  std::string adValue;
  emptyArray(adArray);

  for( int start=v1[nl]; start<v1[nl+1]; start++  ) {
    if (f[start] == '\t') ncol++;
    if (ncol == 0) {
      if ( f[start] != '\t' && f[start] != 'c' && f[start] != 'h' && f[start] != 'r' ) {
        chromArray[chri] = f[start];
        chri++;
      }
    }

    if (ncol == 1) {
      if ( f[start] != '\t' ) {
        posArray[posi] = f[start];
        posi++;
      }
    }

    if (ncol == 3) {
      if ( f[start] != '\t' ) {
        refArray[refi] = f[start];
        refi++;
      }
    }

    if (ncol == 4) {
      if ( f[start] != '\t' ) {
        altArray[alti] = f[start];
        alti++;
      }
    }

    if (ncol == 5) {
      if ( f[start] != '\t' ) {
        qualityArray[qi] = f[start];
        qi++;
      }
    }

    if (ncol == 8) {
      if ( f[start] != '\t' ) {
        if (f[start] == ':') formatHeaderCount++;
        if (f[start] == 'G' && f[start+1] == 'T' && f[start+2] == ':' ) {
          numGT = formatHeaderCount;
        }
        if (f[start] == ':' && f[start+1] == 'G' && f[start+2] == 'Q' &&  f[start+3] == 'X' && f[start+4] == ':') {
          numGQX = formatHeaderCount;
        }

        if (f[start] == ':' && f[start+1] == 'D' && f[start+2] == 'P' && ( f[start+3] == ':' || ( f[start+3] == 'I' && f[start+4] == ':') )) {
          numDP = formatHeaderCount;
        }

        if (f[start] == ':' && f[start+1] == 'A' && f[start+2] == 'D' && f[start+3] == ':' ) {
          numAD = formatHeaderCount;
        }

      }
    }


    if (ncol == 9) {
      if ( f[start] != '\t' ) {
        if (f[start] == ':') formatLastCount++;
        if (formatLastCount == numGT) {
          if ( f[start] != ':' ) {
            gtArray[gti] = f[start];
            gti++;
          }
        }

        if (formatLastCount == numGQX) {
          if ( f[start] != ':' ) {
            gqxArray[gqxi] = f[start];
            gqxi++;
          }
        }

        if (formatLastCount == numDP) {
          if ( f[start] != ':' ) {
            dpArray[dpi] = f[start];
            dpi++;
          }
        }

        if (formatLastCount == numAD) {
          if ( f[start] != ':' ) {
            adArray[adi] = f[start];
            adi++;
          }
        }

      }
    }


  }

  chromValue.append(chromArray);
  posValue.append(posArray);
  refValue.append(refArray);
  altValue.append(altArray);
  quality.append(qualityArray);
  gtValue.append(gtArray);
  gqxValue.append(gqxArray);
  dpValue.append(dpArray);
  adValue.append(adArray);

  if (gqxi < 2 || dpi < 2 || qi < 2) continue;
  if (stoi(gqxValue) < 30) continue;

  std::ofstream myfile ("myRes.txt", std::ios_base::app);
  if (myfile.is_open()) {
    myfile <<
            nl << "\t" <<
            chromValue << "-" << posValue << "-" << refValue << "-" << altValue << "\t" <<
            gtValue << "\t" <<
            gqxValue << "\t" <<
            quality << "\t" <<
            dpValue << "\t" <<
            adValue <<
            "\n";
    myfile.close();
  } else {
    std::cout << "Unable to open file" << '\n';
  }
}

}

void handle_error(const char* msg) {
perror(msg);
exit(255);
}

const char* map_file(const char* fname, size_t& length) {

int fd = open(fname, O_RDONLY);

if (fd == -1)
    handle_error("open");

struct stat sb;
if (fstat(fd, &sb) == -1)
    handle_error("fstat");
length = sb.st_size;

const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u));
if (addr == MAP_FAILED)
    handle_error("mmap");

return addr;
}

现在，我知道我可以使用以下命令打开 GZIP 文件:

#include <fstream>
#include <iostream>
#include <sstream>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/filter/gzip.hpp>

//NB: devo linkare a libreria boost zlib in comando: c++ --std=c++11 -L/opt/X11/lib -lboost_iostreams -lz gzread.cpp -o gzread

using namespace std;
using namespace boost::iostreams;

int main()
{
ifstream file("myfile.gz", ios_base::in | ios_base::binary);
filtering_streambuf<input> inbuf; //iniziallizzo filtering_streambuf inbuf
inbuf.push(gzip_decompressor()); //ci metto dentro decompressore GZIP (se file GZIP)
inbuf.push(file); //ci metto dentro file

//Convert streambuf to istream
std::istream instream(&inbuf);
//Iterate lines
std::string line;

string chr;

while(std::getline(instream, line)) {
istringstream iss(line); // string stream della linea
int i = 0;
while (getline(iss, line, t)) { // read first part up to comma, ignore the comma (il terzo arfomento di getline gli indica dove fermarsi, se assente si ferma a newline)
if (i == 2) cout << line << "n";
++i;
}
}
// copy(inbuf, cout); //copio in stdout
}

这里是文件行的示例:

有没有办法将它们结合起来？或者甚至其他方法，如果它们可以更“性能”。

非常感谢您的建议!

最佳答案

您可以使用 zlib 的 inflate() 函数读取内存映射的 gzip 文件。 (阅读 zlib.h 中的文档。)

但是无论是从文件读取还是从内存映射读取，都无法跳转未压缩的数据。未压缩的数据必须顺序处理，或者顺序保存以供以后的随机访问处理。

关于C++ mmap到 "fast"读取与gzip文件的耦合，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55288310/

文章推荐： c++ - FLTK 回调不接受我的函数指针

文章推荐： c++ - 使用分配器 C++ 创建元组

文章推荐： c++ - 使用 C++ 实现队列和调整数组大小

文章推荐： c++ - Maya API - 从 Material 中获取网格

MySQL全文搜索: need fast insert and fast search
我有一个 mysql 数据库，用户可以在其中输入文本。然后他们需要能够搜索此文本。我刚刚实现了 mysql 全文搜索，它确实使搜索速度快了很多。然而，毫不奇怪，它使插入变慢了。但令我惊讶的是速度慢了
c - 如何知道带有 TCP Fast Open 的 sendto() 是否真的使用了 Fast Open？
我在 Linux 3.15 机器上写了一个 TCP 客户端，它能够使用 TCP Fast Open: status = sendto(sd, (const void *) data,
c++ - "fast"或 "normal"在 "free(): invalid next size (fast)"中是什么意思？
“free(): invalid next size (fast)”中的“fast”或“normal”是什么意思:谁能解释一下这是什么意思/暗示或在哪里可以找到答案？最佳答案您看到的错误消息表明在
c++ - 解码像 FAST 这样的数据协议(protocol)的快速方法是什么？在 FAST 中，数据以字节编码，位作为存在标志？
像 FAST 这样的数据编码协议(protocol)非常巧妙地减少了需要发送的数据量。本质上，一个人得到一个 char*，读取前几个字节作为整数会给你一个 ID 号，它指向你如何解码其余字节的说明(即
wcf - "Fast"WCF服务的集成测试
语境我非常喜欢Roy Osherove所说的“快速集成测试”。这是集成测试，它可以: 严格在您的开发箱上执行。无需单独的环境。尽管正在进行集成测试，但此类测试通常是通过您的单元测试工具(NUnit
performance - "Fast"衡量代码执行时间的方法
我的代码中有一些子例程，我需要测量它们的执行时间。让我们假设例程在极端情况下每秒被调用 10-100 次。在 Fortran 中有许多方法可以测量时间，但由于调用的频率，我需要一种开销最低的方法。时
Java : Counting so fast?
我的电脑中的这段代码在java中执行了1秒，但在C中执行了20多秒。java是如何执行的？ int a[] = new int[50000] ; for(int i = 0 ; i < 50000 ;
fast-ai - 如何将fastai表格模型应用于新数据？
我用 fastai.tabular 训练了一个模型。现在，我有一个合适的学习器。最终，模型将应用于新数据，而不仅仅是在训练集上拟合并在测试集上进行评估等。我尝试了不同的方法，所有这些都导致了错误或一些
performance - "fast"到底是怎样的现代CPU？
当我曾经对嵌入式系统和早期 8/16 位 PC(6502、68K、8086)进行编程时，我对每条指令执行所需的确切时间(以纳秒或微秒为单位)有很好的把握。根据系列的不同，一个(或四个)周期相当于一次“
软件中的位级操作可以是 "fast"吗？
让我立即澄清一下这个听起来很温和的标题。这实际上已经困扰我很长一段时间了，尽管感觉这是一个非常基本的问题。许多语言让开发人员玩弄位，从而给人一种效率错误的印象，例如 bool.h据我了解，C hea
Java Fast 生成带零的字符串
我有一个代码。 private static String generateString(int size) { StringBuffer s = new StringBuffer();
java - "Fast"Java中的整数幂
[简短回答:糟糕的基准测试方法。你会认为我现在已经想通了。] 问题表现为“找到一种快速计算 x^y 的方法，其中 x 和 y 是正整数”。典型的“快速”算法如下所示: public long fast
java 矩阵乘法 (FAST)
我必须乘以 2(大部分时间)稀疏矩阵。这些矩阵相当小(大约 10k*10k)，我有两个至强四核和一个线程来完成这项工作？是否有任何用于多线程 moltiplication 的快速库？还有其他建议吗？
MongoDB 地理空间索引 : how fast is it?
我正在对约 40K 文档的集合执行 where in box 查询。查询耗时约 0.3 秒，获取文档耗时约 0.6 秒(结果集中约有 10K 文档)。文档相当小(每个约 100 字节)，我限制结果只
optimization - R:FAST 多变量优化包？
我正在寻找 4 个变量的标量函数的局部最小值，并且我对变量有范围约束(“框约束”)。函数导数没有封闭形式，因此需要解析导数函数的方法是不可能的。我已经用 optim 尝试了几个选项和控制参数功能，但所
Perl CGI::Fast 关闭连接而不发送数据
我正在尝试部署一个使用 CGI::Application 的 Perl 应用程序通过 Nginx，它们之间使用 FastCGI 进行通信。 Nginx 不断返回“502 Bad Gateway”，错误
C++ mmap到 "fast"读取与gzip文件的耦合
我对 C++ 很陌生，所以很抱歉，如果我问一些愚蠢的问题，但我在网上找不到答案(只有一篇引用 python ( Can mmap and gzip collaborate? ) 的帖子)，试图看看是否
c++ - OpenCV FAST TYPE_5_8
我正在试验不同类型的 OpenCV 的 FAST 检测器。可用的类型有: TYPE_5_8, TYPE_7_12, TYPE_9_16 最后一个是默认的，用这张照片描述: 我假设 TYPE_7_12
Android OpenCV FAST 角点检测过滤
我正在尝试开发一个 android 应用程序，它应该分析来自相机的帧并检测角落。我的目标是检测当前棋盘状态并向服务器提供数据。我已经在我的应用程序中实现了 OpenCV，并且正在尝试使用 FAST
tensorflow - fast-rcnn 目标检测中的误报
我正在使用 Tensorflow 和 faster_rcnn_inception_v2_coco 模型训练对象检测器，但在对视频进行分类时遇到了很多误报。经过一些研究，我发现我需要在训练过程中添加负

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

C++ mmap到 "fast"读取与gzip文件的耦合