gpt4 book ai didi

c# - 如何从 PDF 中提取文本并解码字符?

转载 作者:太空宇宙 更新时间:2023-11-03 11:11:15 24 4
gpt4 key购买 nike

我正在使用 itextsharp 使用以下代码从 pdf 文档中提取文本:

public static bool does_document_text_have_keyword(string keyword, 
string pdf_src, Report report_object) // TEST
{
try
{
PdfReader pdfReader = new PdfReader(pdf_src);
string currentText;
int count = pdfReader.NumberOfPages;
for (int page = 1; page <= count; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
currentText = PdfTextExtractor.GetTextFromPage
(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString
(ASCIIEncoding.Convert
(Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));

report_object.log(currentText); // TEST

if (currentText.IndexOf
(keyword, StringComparison.OrdinalIgnoreCase) != -1) return true;
}
pdfReader.Close();
return false;
}
catch
{
return false;
}
}

但问题是,当我提取文本时,文本没有空格,就好像空格被替换为空字符串一样。然而在pdf文档中,里面有空格。有谁知道这里发生了什么?

最佳答案

我相信您的问题是 SimpleTextExtractionStrategy。来自 http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/SimpleTextExtractionStrategy.html 处的 API 文档

If the PDF renders text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.

尝试使用 LocationTextExtractionStrategy。它的文档说明:

A text extraction renderer that keeps track of relative position of text on page The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.

关于c# - 如何从 PDF 中提取文本并解码字符?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13976233/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com