Hi,
I've got a problem with language detection. There are about 120 documents in
different languages to import, mostly chinese, english, german and others.
English and german are classified quite well, but chinese, japanese and
others stray into a field 'fieldname_lt' - for lituanian language.
As I see many writers, who have good experience with language detection, my
first questions is:
is there something missing (apache-solr-langid-3.6.1.jar hence the
'langdetect-profiles' are not deployed into my glassfish-server, the
deployed apache-solr-3.6.1.war doesn't contain this and other libraries from
the 'dist'-directory)?
My configuration:
true
attr_content, attr_dw_title
language_s
language_all
true
eu
0.2
Experimenting with threshold doesn't change the results so much. The
fallback 'eu' only contains numbers.
The strange indexing distribution (seen in luke) is:
content_de 2,52% (seems correct)
content_en 11,13% (seems correct)
content_eu0,5% (fallback)
content_lt35,5% (not in any configuration file)
Lookimng the content_lt shows mostly chinese, japanese and other "non-latin"
contents.
Any known issue or ignorance for my part?
Thank you in advance!
sincerely, tom
--
View this message in context:
http://lucene.472066.n3.nabble.com/poor-language-detection-tp4008624.html
Sent from the Solr - User mailing list archive at Nabble.com.