Hi, Tanguy,
>For the other implementation ( >http://code.google.com/p/language-detection/ ), it seems to be >performing a first pass on the input, and tries to separate Latin >characters from the others. If there's more non-Latin characters than >Latin ones, then it will process the non-Latin characters only for >language detection. >Oddly, in the other way non-Latin characters are not stripped from the >input if there's more Latin characters than non-Latin ones... The example case does simplify, but it simulates the normal conditions I need to handle, i.e. normally the task is to detect non-Lantin languages, and mostly separate western and eastern languages. >Anyway, LangDetect's implementation ends up with a list of >probabilities, and only the most accurate one is kept by solr's >langdetect processor, if the probability satisfies a certain threshold. Yes, I agree with you on "a list of probabilities", and I think if those probabilities are all returned, then my problem has been solved partially. >In this very particular case, something simple, based on unicode ranges >could be used to provide hints on how to chunk the input. Because we >need to split western and eastern languages, both written in well >isolated unicode character-ranges. >Using this, the language identifier could be fed with chunks that are >mostly made of one language only (presumably), and we could have >different language identifications for each distinct chunks. Intelligent chunk partition might be a different and comprehensive task. Is it possible that the text is processed line by line (or several lines)? If detected language changes in-between two continuous lines (or several lines), it indicates a different language range. Thank you for the thoughtful comments. Best Regards, Bing -- View this message in context: http://lucene.472066.n3.nabble.com/Can-solr-langid-Solr3-5-0-detect-multiple-languages-in-one-text-tp3821210p3824365.html Sent from the Solr - User mailing list archive at Nabble.com.