gpt4 book ai didi

ocr - 具有表格或行的文档的 Tesseract OCR 文本顺序

转载 作者:行者123 更新时间:2023-12-04 14:26:53 31 4
gpt4 key购买 nike

我正在使用 Tesseract OCR将扫描的 PDF 转换为纯文本。总体而言,它非常有效,但我对扫描文本的顺序有疑问。带有表格数据的文档似乎是逐列向下扫描,而更自然的方式是逐行扫描。一个非常小的例子是:

This is column A, row 1   This is column B, row 1    This is column C, row 1
This is column A, row 2 This is column B, row 2 This is column C, row 2

正在产生以下文本:
This is column A, row 1
This is column A, row 2
This is column B, row 1
This is column B, row 2
This is column C, row 1
This is column C, row 2

我开始阅读文档并使用 parameters documented here 进行猜测和测试,蛮力方法但如果有人已经解决了类似的问题,我将不胜感激对修复的见解。它也可能是一些训练数据,但我不知道它是如何工作的。

最佳答案

尝试在单列之一中运行 tesseract Page Segmentation Modes :
tesseract input.tif output-filename --psm 6

By default Tesseract expects a page of text when it segments an image. If you're just seeking to OCR a small region try a different segmentation mode, using the -psm argument. Note that adding a white border to text which is too tightly cropped may also help, see issue 398.

To see a complete list of supported page segmentation modes, use tesseract -h. Here's the [ed: excerpt only] list as of 3.21:

  1. Fully automatic page segmentation, but no OSD. (Default)
  2. Assume a single column of text of variable sizes.
  3. Assume a single uniform block of vertically aligned text.
  4. Assume a single uniform block of text.


请参阅此处的示例: #using-different-page-segmentation-modes

关于ocr - 具有表格或行的文档的 Tesseract OCR 文本顺序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29087739/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com