Re: Can solr-langid(Solr3.5.0) detect multiple languages in one text?

Tanguy Moal Tue, 13 Mar 2012 04:42:25 -0700

Hi all,

I think that depending on the language detector implemention, things mayvary...For Tika, it performs better with longer inputs than shorter ones (as itseems to depend on the probabilistic distribution of ngrams -- ofdifferent sizes -- to perform distance computations with precomputedlanguage-models).From what I've understood, shortening the input could therefor confusethe detector.Nevertheless, feeding the language identifier with text known to bewritten in many languages will decrease the confidence of the languageidentifier's predictions, for sure. If the text is half English, halfChinese, then language detector may even not be able to give aprediction above the certainty threshold.

For the other implementation (http://code.google.com/p/language-detection/ ), it seems to beperforming a first pass on the input, and tries to separate Latincharacters from the others. If there's more non-Latin characters thanLatin ones, then it will process the non-Latin characters only forlanguage detection.Oddly, in the other way non-Latin characters are not stripped from theinput if there's more Latin characters than non-Latin ones...

Anyway, LangDetect's implementation ends up with a list ofprobabilities, and only the most accurate one is kept by solr'slangdetect processor, if the probability satisfies a certain threshold.

The tricky part here is chunking the input into an arbitrary number ofchunks : this is eventually expansive and complicated, so there's theneed to find a good candidate partition of the input.

In this very particular case, something simple, based on unicode rangescould be used to provide hints on how to chunk the input. Because weneed to split western and eastern languages, both written in wellisolated unicode character-ranges.Using this, the language identifier could be fed with chunks that aremostly made of one language only (presumably), and we could havedifferent language identifications for each distinct chunks.

The hard part remains for languages sharing a large number of charactersI guess. It's hard to say here are the french parts and there are theitalian parts based on unicode character ranges only.That's even more complicated when input text is badly accentued, aphenomenon occuring quite frequently, but that's another thread :)

I don't know if that helps, I was just reading the thread mentionedyesterday and then this message about language detection arrived on thelist...


Kind regards,

--
Tanguy

Le 13/03/2012 09:55, Jan Høydahl a écrit :

Hi,

Language detection cannot do that as of now. It would be a great improvement 
though. Language detectors are pluggable, perhaps if you know of a Java 
language detector which can do this we could plug it in? Or we could extend the 
current identifier with a capability of first splitting the text into chunks 
and then do langid on each chunk. If you'd like to open a JIRA for this, it 
will not be forgotten...

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 13. mars 2012, at 04:25, bing wrote:

Hi, all,

I am using solr-langid(Solr3.5.0) to do language detection, and I hope
multiple languages in one text can be detected.

The example text is:
咖哩起源於印度。印度民間傳說咖哩是佛祖釋迦牟尼所創，由於咖哩的辛辣與香味可以幫助遮掩羊肉的腥騷，此舉即為用以幫助不吃豬肉與牛肉的印度人。在泰米爾語中，「kari」是「醬」的意思。在馬來西亞，kari也稱dal（當在mamak檔）。早期印度被蒙古人所建立的莫臥兒帝國（Mughal
Empire）所統治過，其間從波斯（現今的伊朗）帶來的飲食習慣，從而影響印度人的烹調風格直到現今。
Curry (plural, Curries) is a generic term primarily employed in Western
culture to denote a wide variety of dishes originating in Indian, Pakistani,
Bangladeshi, Sri Lankan, Thai or other Southeast Asian cuisines. Their
common feature is the incorporation of more or less complex combinations of
spices and herbs, usually (but not invariably) including fresh or dried hot
capsicum peppers, commonly called "chili" or "cayenne" peppers.

I want the text can be separated into two parts, and the part in Chinese
goes to "text_zh-tw" while the other one "text_en". Can I do something like
that?

Thank you.

Best Regards,
Bing


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-solr-langid-Solr3-5-0-detect-multiple-languages-in-one-text-tp3821210p3821210.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can solr-langid(Solr3.5.0) detect multiple languages in one text?

Reply via email to