poor language detection

tomtom Tue, 18 Sep 2012 08:23:48 -0700

Hi,

I've got a problem with language detection. There are about 120 documents in
different languages to import, mostly chinese, english, german and others.
English and german are classified quite well, but chinese, japanese and
others stray into a field 'fieldname_lt' - for lituanian language.


As I see many writers, who have good experience with language detection, my
first questions is:

is there something missing (apache-solr-langid-3.6.1.jar hence the
'langdetect-profiles' are not deployed into my glassfish-server, the
deployed apache-solr-3.6.1.war doesn't contain this and other libraries from
the 'dist'-directory)?


My configuration:

    <updateRequestProcessorChain name="langid">
       <processor
class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
         <lst name="defaults">
          <bool name="langid">true</bool>
           
          <str name="langid.fl">attr_content, attr_dw_title</str>
          <str name="langid.langField">language_s</str>
          <str name="langid.langsField">language_all</str> 
          <str name="langid.map">true</str>
          
          <str name="langid.fallback">eu</str>
          <str name="langid.threshold">0.2</str>
         </lst>
       </processor>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>


Experimenting with threshold doesn't change the results so much. The
fallback 'eu' only contains numbers.
The strange indexing distribution (seen in luke) is:
  content_de   2,52%  (seems correct)
  content_en  11,13% (seems correct)
  content_eu    0,5%  (fallback)
  content_lt    35,5%  (not in any configuration file)

Lookimng the content_lt shows mostly chinese, japanese and other "non-latin"
contents.


Any known issue or ignorance for my part?


Thank you in advance!

sincerely, tom



--
View this message in context: 
http://lucene.472066.n3.nabble.com/poor-language-detection-tp4008624.html
Sent from the Solr - User mailing list archive at Nabble.com.

poor language detection

Reply via email to