On 2/17/09 12:26 PM, "Grant Ingersoll" <gsing...@apache.org> wrote:

> If purchasing, several companies offer solutions, but I don't know
> that their quality is any better than what you can get through open
> source, as generally speaking, the problem is solved with a high
> degree of accuracy through n-gram analysis.

The expensive part of the problem is getting a good corpus in each
language, tuning the classifier, and QA. The commercial ones usually
recognize encoding and language, which is more complicated. Sorting
out the ISO-2022 codes is a real mess, for example.

Pre-Unicode PDF files are also a horror. To do it right, you need
to recognize which fonts are Central European, and so on.

wunder

Reply via email to