gpt4 book ai didi

.net - 使用 iText7 在 PDF 中搜索文本并取回整个框文本

转载 作者:行者123 更新时间:2023-12-05 07:21:21 25 4
gpt4 key购买 nike

VB2017 使用 iText7。我正在寻找一种在 PDF 中搜索关键文本的方法。当我找到关键文本时,我想返回它所在的框中的所有文本。

例如,在此 PDF 中,我查找关键词“可用长度”,并希望在找到它的框中返回文本“Rwy 33 PAPI-L,可用长度,注释。”

enter image description here

这是我目前所拥有的 ( based on this ),并希望对此概念有任何建议或建议:

    Public Function FindTextInPdfFile(ByVal fileName As String, ByVal searchText As String, ByVal IsCaseSensitive As Boolean) As List(Of String)
'basic checks
If String.IsNullOrWhiteSpace(fileName) Then Return Nothing
If String.IsNullOrWhiteSpace(searchText) Then Return Nothing
If Not File.Exists(fileName) Then Return Nothing

'setup the regex to use or not use case sensitivity in the match
Dim pattern As String = String.Format("({0})", searchText)
Dim regEx As Regex = Nothing
If IsCaseSensitive Then
regEx = New Regex(pattern)
Else
regEx = New Regex(pattern, RegexOptions.IgnoreCase)
End If

'setup the extraction strategy and temp buffer
Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy
Dim buffBasic As New StringBuilder

'open the PDF and do a basic search for the text in each page. for each page where we detect the search item
'we will add that to the temp buffer.
Using pdfReader As PdfReader = New PdfReader(fileName)
Using pdfDocument As PdfDocument = New PdfDocument(pdfReader)
For pageNum As Integer = 1 To pdfDocument.GetNumberOfPages
Dim page As PdfPage = pdfDocument.GetPage(pageNum)
Dim currentPageText As String = PdfTextExtractor.GetTextFromPage(page, strategy)

If regEx.Matches(currentPageText).Count > 0 Then
'Debug.Print("found search text [{0}] in page num {1}", searchText, pageNum)
'Debug.Print("GetResultantText={0}", strategy.GetResultantText)

'GetResultantText has lines of text separated by an LF
buffBasic.Append(strategy.GetResultantText & lf)
End If
Next pageNum
End Using
End Using

'the buffer should have lines of text separated by an LF
Dim linesBasic As List(Of String) = buffBasic.ToString.Split(lf).ToList
Dim linesMatch As List(Of String) = linesBasic.FindAll(Function(x) regEx.Matches(x).Count > 0)
Debug.Print("match count={0}", linesMatch.Count)
For Each line In linesMatch
Debug.Print("line={0}", line)
Next line

Return linesMatch
End Function

在示例 PDF 上测试此结果

FindTextInPdfFile(pdf, "usable length", True)
match count=1
line=Rwy 33 PAPI-L, usable length, notes.

最佳答案

其中 page = PdfPage

    /// <summary>
/// determines if this document contains the provided text
/// </summary>
/// <param name="find">string</param>
/// <param name="caseSensitive">bool</param>
/// <returns>bool</returns>
public bool Contains(string find, bool caseSensitive = true)
{
string content = PdfTextExtractor.GetTextFromPage(page);
if (string.IsNullOrEmpty(content))
{
return false;
}
return content.IndexOf(find, caseSensitive ? StringComparison.InvariantCulture : StringComparison.InvariantCultureIgnoreCase) > -1;
}

关于.net - 使用 iText7 在 PDF 中搜索文本并取回整个框文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56942874/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com