Hi, Tanguy, 


>For the other implementation (
>http://code.google.com/p/language-detection/ ), it seems to be
>performing a first pass on the input, and tries to separate Latin
>characters from the others. If there's more non-Latin characters than
>Latin ones, then it will process the non-Latin characters only for
>language detection.
>Oddly, in the other way non-Latin characters are not stripped from the
>input if there's more Latin characters than non-Latin ones...

The example case does simplify, but it simulates the normal conditions I
need to handle, i.e. normally the task is to detect  non-Lantin  languages,
and mostly separate western and eastern languages. 

>Anyway, LangDetect's implementation ends up with a list of
>probabilities, and only the most accurate one is kept by solr's
>langdetect processor, if the probability satisfies a certain threshold. 

Yes, I agree with you on "a list of probabilities", and I think if those
probabilities are all returned, then my problem has been solved partially. 

>In this very particular case, something simple, based on unicode ranges
>could be used to provide hints on how to chunk the input. Because we
>need to split western and eastern languages, both written in well
>isolated unicode character-ranges.
>Using this, the language identifier could be fed with chunks that are
>mostly made of one language only (presumably), and we could have
>different language identifications for each distinct chunks. 

Intelligent chunk partition might be a different and comprehensive task. Is
it possible that the text is processed line by line (or several lines)? If
detected language changes in-between two continuous lines (or several
lines), it indicates a different language range.


Thank you for the thoughtful comments.  

Best Regards, 
Bing 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-solr-langid-Solr3-5-0-detect-multiple-languages-in-one-text-tp3821210p3824365.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to