Hi,

You should avoid using Tika's language detector, it supports only about 15 
languages. Use the LangDetect library instead, it detects more languages by 
default and has higher accuracy. For both detectors you can create custom 
(better) profiles.

Cheers

 
 
-----Original message-----
> From:tomtom <tkaefferb...@conet.de>
> Sent: Tue 18-Sep-2012 17:27
> To: solr-user@lucene.apache.org
> Subject: poor language detection
> 
> Hi,
> 
> I've got a problem with language detection. There are about 120 documents in
> different languages to import, mostly chinese, english, german and others.
> English and german are classified quite well, but chinese, japanese and
> others stray into a field 'fieldname_lt' - for lituanian language. 
> 
> As I see many writers, who have good experience with language detection, my
> first questions is:
> 
> is there something missing (apache-solr-langid-3.6.1.jar hence the
> 'langdetect-profiles' are not deployed into my glassfish-server, the
> deployed apache-solr-3.6.1.war doesn't contain this and other libraries from
> the 'dist'-directory)?
> 
> 
> My configuration:
> 
>     <updateRequestProcessorChain name="langid">
>        <processor
> class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
>          <lst name="defaults">
>                 <bool name="langid">true</bool>
>            
>           <str name="langid.fl">attr_content, attr_dw_title</str>
>           <str name="langid.langField">language_s</str>
>                 <str name="langid.langsField">language_all</str> 
>                 <str name="langid.map">true</str>
>                 
>           <str name="langid.fallback">eu</str>
>           <str name="langid.threshold">0.2</str>
>          </lst>
>        </processor>
>        <processor class="solr.LogUpdateProcessorFactory" />
>        <processor class="solr.RunUpdateProcessorFactory" />
>      </updateRequestProcessorChain>
> 
> 
> Experimenting with threshold doesn't change the results so much. The
> fallback 'eu' only contains numbers.
> The strange indexing distribution (seen in luke) is:
>   content_de   2,52%  (seems correct)
>   content_en  11,13% (seems correct)
>   content_eu    0,5%  (fallback)
>   content_lt    35,5%  (not in any configuration file)
> 
> Lookimng the content_lt shows mostly chinese, japanese and other "non-latin"
> contents.
> 
> 
> Any known issue or ignorance for my part?
> 
> 
> Thank you in advance!
> 
> sincerely, tom
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/poor-language-detection-tp4008624.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 

Reply via email to