I was looking for such a tool and haven't found it yet.
Using StandardAnalyzer one can obtain some form of token-stream which can be used for "agnostic analysis". Clearly, then, something that matches words in a dictionary and decides on the language based on the language of the majority could do a decent job to decide the analyzer.

Does such a tool exist?
It doesn't seem too hard for Lucene.

paul


Le 17-févr.-09 à 04:44, Otis Gospodnetic a écrit :

The best option would be to identify the language after parsing the PDF and then index it using an appropriate analyzer defined in schema.xml.

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to