On 2/17/09 12:26 PM, "Grant Ingersoll" <gsing...@apache.org> wrote:
> If purchasing, several companies offer solutions, but I don't know > that their quality is any better than what you can get through open > source, as generally speaking, the problem is solved with a high > degree of accuracy through n-gram analysis. The expensive part of the problem is getting a good corpus in each language, tuning the classifier, and QA. The commercial ones usually recognize encoding and language, which is more complicated. Sorting out the ISO-2022 codes is a real mess, for example. Pre-Unicode PDF files are also a horror. To do it right, you need to recognize which fonts are Central European, and so on. wunder