Hi, I've got a problem with language detection. There are about 120 documents in different languages to import, mostly chinese, english, german and others. English and german are classified quite well, but chinese, japanese and others stray into a field 'fieldname_lt' - for lituanian language.
As I see many writers, who have good experience with language detection, my first questions is: is there something missing (apache-solr-langid-3.6.1.jar hence the 'langdetect-profiles' are not deployed into my glassfish-server, the deployed apache-solr-3.6.1.war doesn't contain this and other libraries from the 'dist'-directory)? My configuration: <updateRequestProcessorChain name="langid"> <processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <bool name="langid">true</bool> <str name="langid.fl">attr_content, attr_dw_title</str> <str name="langid.langField">language_s</str> <str name="langid.langsField">language_all</str> <str name="langid.map">true</str> <str name="langid.fallback">eu</str> <str name="langid.threshold">0.2</str> </lst> </processor> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> Experimenting with threshold doesn't change the results so much. The fallback 'eu' only contains numbers. The strange indexing distribution (seen in luke) is: content_de 2,52% (seems correct) content_en 11,13% (seems correct) content_eu 0,5% (fallback) content_lt 35,5% (not in any configuration file) Lookimng the content_lt shows mostly chinese, japanese and other "non-latin" contents. Any known issue or ignorance for my part? Thank you in advance! sincerely, tom -- View this message in context: http://lucene.472066.n3.nabble.com/poor-language-detection-tp4008624.html Sent from the Solr - User mailing list archive at Nabble.com.