gpt4 book ai didi

ocr - 训练 tesseract-OCR 4 的问题 - Empy 形状表

转载 作者:行者123 更新时间:2023-12-04 03:46:09 25 4
gpt4 key购买 nike

我正在尝试用特定图片训练 Tesseract 4(以读取具有 7 段的万用表),

请注意,我知道来自 Arthur Augusto 的所有训练数据 https://github.com/arturaugusto/display_ocr但我需要根据自己的数据训练 Tesseract。

为了训练苔丝,我遵循了不同的教程(如 https://robipritrznik.medium.com/recognizing-vehicle-license-plates-on-images-using-tesseract-4-ocr-with-custom-trained-models-4ba9861595e7https://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/ )

但是当我使用我自己的数据

运行shapeclustering命令时,我总是遇到问题

(示例数据为 https://github.com/tesseract-ocr/tesseract/issues/1174#issuecomment-338448972,一切正常)

事实上,当我尝试执行 shapeclusturing 命令时,它有这个输出 screenshot然后我的 shape_table 是空的,训练效率不高......

使用示例数据,它工作正常并且 shape_table 填充得很好

我猜我在生成 box 文件时遇到了问题,这是我创建 box 文件的过程:

我用

tesseract imageFileName.tif imageFileName  batch.nochop makebox

生成盒子文件的命令,然后我用 JtessboxEditor 编辑它。

所以我看不出我的 .box/.tif 数据对哪里出了问题。

祝你有美好的一天,谢谢你帮助我\n阿德里安

这是我在生成和编辑框文件后用于训练的完整批处理脚本。

set name=sev7.exp0
set shortName=sev7

echo Run Tesseract for Training..
tesseract.exe %name%.tif %name% nobatch box.train

echo Compute the Character Set..
unicharset_extractor.exe %name%.box

shapeclustering -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
mftraining -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
echo Clustering..
cntraining.exe %name%.tr
echo Rename Files..
rename normproto %shortName%.normproto
rename inttemp %shortName%.inttemp
rename pffmtable %shortName%.pffmtable
rename shapetable %shortName%.shapetable
echo Create Tessdata..
combine_tessdata.exe %shortName%.
echo. & pause

最佳答案

好吧,我终于成功训练了 tesseract。

解决方法是在使用命令的时候加一个--psm参数

tesseract.exe %name%.tif %name% nobatch box.train

作为

tesseract.exe %name%.%typeFile% %name%  --psm %psm% nobatch box.train

注意所有的psm值都是:

REM pagesegmode values are:

REM 0 = Orientation and script detection (OSD) only.
REM 1 = Automatic page segmentation with OSD.
REM 2 = Automatic page segmentation, but no OSD, or OCR
REM 3 = Fully automatic page segmentation, but no OSD. (Default)
REM 4 = Assume a single column of text of variable sizes.
REM 5 = Assume a single uniform block of vertically aligned text.
REM 6 = Assume a single uniform block of text.
REM 7 = Treat the image as a single text line.
REM 8 = Treat the image as a single word.
REM 9 = Treat the image as a single word in a circle.
REM 10 = Treat the image as a single character.
REM 11 = Sparse text. Find as much text as possible in no particular order.
REM 12 Sparse text with OSD.
REM 13 Raw line. Treat the image as a single text line bypassing hacks that are Tesseract-specific.

成立于 https://github.com/tesseract-ocr/tesseract/issues/434

关于ocr - 训练 tesseract-OCR 4 的问题 - Empy 形状表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65116781/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com