gpt4 book ai didi

c++ - 使用 zlib 从 pdf 中提取文本

转载 作者:太空宇宙 更新时间:2023-11-04 15:45:39 26 4
gpt4 key购买 nike

我正在使用该功能在 pdf 文件中查找文本并将该文本替换为另一个文本。问题是当我膨胀然后更改文本和放气时,在最终的 pdf 中有时会遗漏一些文本或图形。这是我的代码中的错误还是 zlib 库不支持此压缩或其他什么?

// Open the PDF source file:
FILE *pdfFile = fopen([sourceFile cStringUsingEncoding:NSUTF8StringEncoding], "rb");

if (pdfFile) {
// Get the file length:
int fseekres = fseek(pdfFile, 0, SEEK_END);

if (fseekres != 0) {
fclose(pdfFile);
return nil;
}

long filelen = ftell(pdfFile);
fseekres = fseek(pdfFile, 0, SEEK_SET);

if (fseekres != 0) {
fclose(pdfFile);
return nil;
}

char *buffer = new char[filelen];
size_t actualread = fread(buffer, filelen, 1, pdfFile);

if (actualread != 1) {
fclose(pdfFile);
return nil;
}

bool morestreams = true;

while (morestreams) {
size_t streamstart = [self findStringInBuffer:buffer search:(char *)"stream" buffersize:filelen];
size_t streamend = [self findStringInBuffer:buffer search:(char *)"endstream" buffersize:filelen];

[self saveFile:buffer len:streamstart + 7 fileName:[destFile cStringUsingEncoding:NSUTF8StringEncoding]];

if (streamstart > 0 && streamend > streamstart) {
streamstart += 6;

if (buffer[streamstart] == 0x0d && buffer[streamstart + 1] == 0x0a) {
streamstart += 2;
} else if (buffer[streamstart] == 0x0a) {
streamstart++;
}

if (buffer[streamend - 2] == 0x0d && buffer[streamend - 1] == 0x0a) {
streamend -= 2;
} else if (buffer[streamend - 1] == 0x0a) {
streamend--;
}

size_t outsize = (streamend - streamstart) * 10;
char *output = new char[outsize];

z_stream zstrm;
zstrm.zalloc = Z_NULL;
zstrm.zfree = Z_NULL;
zstrm.opaque = Z_NULL;
zstrm.avail_in = (uint)(streamend - streamstart + 1);
zstrm.avail_out = (uint)outsize;
zstrm.next_in = (Bytef *)(buffer + streamstart);
zstrm.next_out = (Bytef *)output;

int rsti = inflateInit(&zstrm);

if (rsti == Z_OK) {
int rst2 = inflate(&zstrm, Z_FINISH);
inflateEnd(&zstrm);

if (rst2 >= 0) {
size_t totout = zstrm.total_out;

//search and replace text code here

size_t coutsize = (streamend - streamstart + 1) * 10;
char *coutput = new char[coutsize];

z_stream c_stream;
c_stream.zalloc = Z_NULL;
c_stream.zfree = Z_NULL;
c_stream.opaque = Z_NULL;
c_stream.total_out = 0;
c_stream.avail_in = (uint)totout;
c_stream.avail_out = (uint)coutsize;
c_stream.next_in = (Bytef *)output;
c_stream.next_out = (Bytef *)coutput;

rsti = deflateInit(&c_stream, Z_DEFAULT_COMPRESSION);

if (rsti == Z_OK) {
rsti = deflate(&c_stream, Z_FINISH);
deflateEnd(&c_stream);

if (rsti >= 0) {
[self saveFile:coutput len:c_stream.total_out fileName:[destFile cStringUsingEncoding:NSUTF8StringEncoding]];
}
}

delete [] coutput; coutput = 0;
[self saveFile:(char *)"\nendstr" len:7 fileName:[destFile cStringUsingEncoding:NSUTF8StringEncoding]];
}
}

delete[] output; output = 0;
buffer += streamend + 7;
filelen = filelen - (streamend + 7);
} else {
morestreams = false;
}
}

[self saveFile:buffer len:filelen fileName:[destFile cStringUsingEncoding:NSUTF8StringEncoding]];
}

fclose(pdfFile);

最佳答案

您关于可以在内容流中逐字找到文本的假设是错误的。

假设您有一个内容为 Hello World 的 PDF。然后你可以有一个看起来像这样的流:

q
BT
36 806 Td
0 -18 Td
/F1 12 Tf
(Hello World!)Tj
0 0 Td
ET
Q

但它也可以是这样的:

Q
BT
/F1 12 Tf
88.66 367 Td
(ld) Tj
-22 0 Td
(Wor) Tj
-15.33 0 Td
(llo) Tj
-15.33 0 Td
(He) Tj
ET
q

您的代码会在前一个流中检测到单词“Hello”,但会在后一个流中检测不到。

PDF 查看器将以完全相同的方式呈现两个流:您将在完全相同的位置看到“Hello World”。

有时字符串被分解成更小的部分,你会经常发现文本数组来引入字距调整等......这是 PDF 中的所有标准做法。

PDF 不是一种适合编辑的格式。我并不是说这是不可能的,但如果您想满足能够用 PDF 流中的一个字符串替换另一个字符串的要求,您需要花费几周的额外编程时间。

关于c++ - 使用 zlib 从 pdf 中提取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17027211/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com