I was looking for such a tool and haven't found it yet.Using StandardAnalyzer one can obtain some form of token-stream which can be used for "agnostic analysis". Clearly, then, something that matches words in a dictionary and decides on the language based on the language of the majority could do a decent job to decide the analyzer.
Does such a tool exist? It doesn't seem too hard for Lucene. paul Le 17-févr.-09 à 04:44, Otis Gospodnetic a écrit :
The best option would be to identify the language after parsing the PDF and then index it using an appropriate analyzer defined in schema.xml.
smime.p7s
Description: S/MIME cryptographic signature