Bug#699609: tesseract-ocr: please provide source for language files

Jeff Breidenbach Fri, 31 May 2013 13:52:04 -0700

Sorry for the slow reply. I talked with upstream, and I think the
language files are DFSG compliant both in letter and spirit.


The language files are the product of machine learning, against
a few hundred fonts and a bunch of symbolic text. The machine
learning program already ships with Tesseract. The fonts
already ship in Debian. The list of fonts is a little too long for
this email,  but I have it in hand and will check that it is
documented appropriately along with the overall procedure.
The symbolic text can be extracted from the language
packages using tools that already ship with Tesseract [1].

I think it is technically inappropriate to run machine learning as
part of the package building process, for two reasons. First, the
machine learning process is very computationally expensive.
Second, there are many labor intensive manual steps involved.
Upstream  continues to work on tools to reduce the amount of
labor.

Please let me know if this resolves your concerns.

Cheers,
Jeff

===

[1] The program combine_tessdata can extract individual components
from the combined binary traineddata file. The program dawg2wordlist
unpacks the binary dictionary (dawg) files back to their original input \
wordlist text files.

Bug#699609: tesseract-ocr: please provide source for language files

Reply via email to