Sorry for the slow reply. I talked with upstream, and I think the language files are DFSG compliant both in letter and spirit.
The language files are the product of machine learning, against a few hundred fonts and a bunch of symbolic text. The machine learning program already ships with Tesseract. The fonts already ship in Debian. The list of fonts is a little too long for this email, but I have it in hand and will check that it is documented appropriately along with the overall procedure. The symbolic text can be extracted from the language packages using tools that already ship with Tesseract [1]. I think it is technically inappropriate to run machine learning as part of the package building process, for two reasons. First, the machine learning process is very computationally expensive. Second, there are many labor intensive manual steps involved. Upstream continues to work on tools to reduce the amount of labor. Please let me know if this resolves your concerns. Cheers, Jeff === [1] The program combine_tessdata can extract individual components from the combined binary traineddata file. The program dawg2wordlist unpacks the binary dictionary (dawg) files back to their original input \ wordlist text files.