gpt4 book ai didi

.net - 使用 .NET VB 或 C# 中的 acrobat.tlb 从 .pdf 中提取完整的带连字符的单词

转载 作者:行者123 更新时间:2023-12-04 12:45:04 25 4
gpt4 key购买 nike

我正在使用 acrobat.tlb 库解析 .pdf

带连字符的单词被拆分成新行,连字符被删除。

例如
ABC-123-XXX-987

解析为:
美国广播公司
123
XXX
987

如果我使用 iTextSharp 解析文本,它会解析文件中显示的整个字符串,这是我想要的行为。但是,我需要在 .pdf 中突出显示这些字符串(序列号),而 iTextSharp 没有将突出显示放在正确的位置......因此 acrobat.tlb

我正在使用此代码,来自此处:http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf

 ' filey = "*your full file name including directory here*"
AcroExchApp = CreateObject("AcroExch.App")
AcroExchAVDoc = CreateObject("AcroExch.AVDoc")
' Open the [strfiley] pdf file
AcroExchAVDoc.Open(filey, "")

' Get the PDDoc associated with the open AVDoc
AcroExchPDDoc = AcroExchAVDoc.GetPDDoc
sustext = "accessorizes"
suktext = "accessorises"
' get JavaScript Object
' note jso is related to PDDoc of a PDF,
jso = AcroExchPDDoc.GetJSObject
' count
nCount = 0
nCount1 = 0
gbStop = False
bUSCnt = False
bUKCnt = False
' search for the text
If Not jso Is Nothing Then
' total number of pages
nPages = jso.numpages

' Go through pages
For i = 0 To nPages - 1
' check each word in a page
nWords = jso.getPageNumWords(i)
For j = 0 To nWords - 1
' get a word

word = Trim(CStr(jso.getPageNthWord(i, j)))
'If VarType(word) = VariantType.String Then
If word <> "" Then
' compare the word with what the user wants
If Trim(sustext) <> "" Then
result = StrComp(word, sustext, vbTextCompare)
' if same
If result = 0 Then
nCount = nCount + 1
If bUSCnt = False Then
iUSCnt = iUSCnt + 1
bUSCnt = True
End If
End If
End If
If suktext<> "" Then
result1 = StrComp(word, suktext, vbTextCompare)
' if same
If result1 = 0 Then
nCount1 = nCount1 + 1
If bUKCnt = False Then
iUKCnt = iUKCnt + 1
bUKCnt = True
End If
End If
End If
End If
Next j
Next i
jso = Nothing
End If

该代码执行突出显示文本的工作,但带有 'word' 变量的 FOR 循环将带连字符的字符串拆分为组成部分。
For i = 0 To nPages - 1
' check each word in a page
nWords = jso.getPageNumWords(i)
For j = 0 To nWords - 1
' get a word

word = Trim(CStr(jso.getPageNthWord(i, j)))

有谁知道如何使用 acrobat.tlb 维护整个字符串?我相当广泛的搜索结果一片空白。

最佳答案

我可以理解 iTextSharp高亮文本时很麻烦,因为你必须画一个矩形并且变得复杂但acrobat.tlb的解决方案也有它的缺点。是不是 免费的,很少有人会用。对于我们其他人来说,更好的解决方案是免费且易于使用的 Spire.Pdf .您可以从 NuGet 包中获取它。该代码执行以下操作:

  • Opens .pdf
  • Read each text page
  • using regular expression find matches
  • save them to a list of strings eliminating duplicates
  • for each string in this list search page and highlight the word


代码:
Dim pdf As PdfDocument = New PdfDocument("Path")
Dim pattern As String = "([A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3}[-][A-Z,0-9]{3})"
Dim matches As MatchCollection

Dim result As PdfTextFind() = Nothing
Dim content As New StringBuilder()
Dim matchList As New List(Of String)

For Each page As PdfPageBase In pdf.Pages
'get text from current page
content.Append(page.ExtractText())

'find matches
matches = Regex.Matches(content.ToString, pattern, RegexOptions.None)

matchList.Clear()

'Assign each match to a string list.
For Each match As Match In matches
matchList.Add(match.Value)
Next

'Eliminate duplicates.
matchList = matchList.Distinct.ToList

'for each string in list
For i = 0 To matchList.Count - 1
'find all occurances of matchList(i) string in page and highlight it
result = page.FindText(matchList(i)).Finds

For Each find As PdfTextFind In result
find.ApplyHighLight(Color.BlueViolet) 'you can set your color preference
Next

Next 'matchList

Next 'page

pdf.SaveToFile("New Path")

pdf.Close()
pdf.Dispose()

我不太擅长 regular expression所以你可以实现你的。无论如何,这就是我的方法。

关于.net - 使用 .NET VB 或 C# 中的 acrobat.tlb 从 .pdf 中提取完整的带连字符的单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52291322/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com