.net - 如何为 Tesseract 提供单词列表(.NET 包装器)-6ren

.net - 如何为 Tesseract 提供单词列表(.NET 包装器)

转载作者：行者123 更新时间：2023-12-03 07:27:59

TLDR;版本:

有谁有我能看到的使用 .NET 包装器的 Tasseract 的工作“bazaar”配置吗？

我很确定这就是我想要的(只识别列表中的一些单词)，但它似乎没有做任何事情

<小时/>

我有一个非常简短的可能字符串列表，我正在尝试查找(1-4 个单词)。 Tesseract 的文档指出:

If you want to replace the whole dictionary, you will need to unpackthe .traineddata file, create a new word-dawg file, and then pack thefiles back into a .traineddata file. See TrainingTesseract for moredetails.

这听起来正是我想要的!所以我看TrainingTesseract并查看:

The traineddata file is simply a concatenation of the input files,with a table of contents that contains the offsets of the known filetypes. See ccutil/tessdatamanager.h in the source code for a list ofthe currently accepted filenames.

太棒了。那么，我该如何解压这个简单的输入文件串联，修改内容和标题并重新打包呢？ :)

This post似乎是同一个问题 - 这涉及简单地关闭默认词典并使用用户词来代替:

let’s suppose you want to OCR in English, but suppress the normaldictionary and load an alternative word list and an alternative listof patterns — these two files are the most commonly used extra datafiles.

If your language pack is in /path/to/eng.traineddata and the hocrconfig is in /path/to/configs/hocr then create three new files:

/path/to/eng.user-words: -snip

/path/to/eng.user-patterns: -snip

/path/to/configs/bazaar: -snip

Now, if you pass theword bazaar as a trailing command line parameter to Tesseract,Tesseract will not bother loading the system dictionary nor thedictionary of frequent words and will load and use the eng.user-wordsand eng.user-patterns files you provided. The former is a simple wordlist, one per line. The format of the latter is documented indict/trie.h on read_pattern_list().

但是这样做之后就没有任何区别了!

我正在创建引擎:

using (engine = new TesseractEngine(@"C:\src\x\tessdata", "eng", EngineMode.Default, @"C:\src\x\tessdata\engine.config"))

制作了(UTF-8、unix 行结尾)文件 engine.config:

load_system_dawg     F
load_freq_dawg       F
user_words_suffix    user-words
user_patterns_suffix user-patterns

并创建了 eng.user-patterns 和 eng.user-words(UTF-8，Unix 行结尾)文件以及 eng.traineddata。

最佳答案

你明白了吗？

看起来这是一种增加其查找字典单词偏好的方法:

https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-increase-the-trust-instrength-of-the-dictionary

如何增加词典的信任度/强度？

对于 tesseract-ocr < 3.01，尝试将 dict/permute.cpp 中的 NON_WERD 和 GARBAGE_STRING 提高到 3 甚至 5。

对于 tesseract-ocr >= 3.01，尝试在配置文件中增加变量 language_model_penalty_non_freq_dict_word 和 language_model_penalty_non_dict_word 。默认情况下，它们分别为 0.1 和 0.15。

关于.net - 如何为 Tesseract 提供单词列表(.NET 包装器)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40127994/

文章推荐： spring-roo - 什么是春露？

文章推荐： svn - TortoiseSVN 1.7 能否在 SVN 1.6 存储库上正常工作？

文章推荐： haskell - 如何将 GHCi 与新的 cabal 1.17 沙箱一起使用？

文章推荐： wordpress - 为每个菜单项添加额外字段

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

.net - 如何为 Tesseract 提供单词列表(.NET 包装器)