Hi all,
I think that depending on the language detector implemention, things may
vary...
For Tika, it performs better with longer inputs than shorter ones (as it
seems to depend on the probabilistic distribution of ngrams -- of
different sizes -- to perform distance computations with precomputed
language-models).
From what I've understood, shortening the input could therefor confuse
the detector.
Nevertheless, feeding the language identifier with text known to be
written in many languages will decrease the confidence of the language
identifier's predictions, for sure. If the text is half English, half
Chinese, then language detector may even not be able to give a
prediction above the certainty threshold.
For the other implementation (
http://code.google.com/p/language-detection/ ), it seems to be
performing a first pass on the input, and tries to separate Latin
characters from the others. If there's more non-Latin characters than
Latin ones, then it will process the non-Latin characters only for
language detection.
Oddly, in the other way non-Latin characters are not stripped from the
input if there's more Latin characters than non-Latin ones...
Anyway, LangDetect's implementation ends up with a list of
probabilities, and only the most accurate one is kept by solr's
langdetect processor, if the probability satisfies a certain threshold.
The tricky part here is chunking the input into an arbitrary number of
chunks : this is eventually expansive and complicated, so there's the
need to find a good candidate partition of the input.
In this very particular case, something simple, based on unicode ranges
could be used to provide hints on how to chunk the input. Because we
need to split western and eastern languages, both written in well
isolated unicode character-ranges.
Using this, the language identifier could be fed with chunks that are
mostly made of one language only (presumably), and we could have
different language identifications for each distinct chunks.
The hard part remains for languages sharing a large number of characters
I guess. It's hard to say here are the french parts and there are the
italian parts based on unicode character ranges only.
That's even more complicated when input text is badly accentued, a
phenomenon occuring quite frequently, but that's another thread :)
I don't know if that helps, I was just reading the thread mentioned
yesterday and then this message about language detection arrived on the
list...
Kind regards,
--
Tanguy
Le 13/03/2012 09:55, Jan Høydahl a écrit :
Hi,
Language detection cannot do that as of now. It would be a great improvement
though. Language detectors are pluggable, perhaps if you know of a Java
language detector which can do this we could plug it in? Or we could extend the
current identifier with a capability of first splitting the text into chunks
and then do langid on each chunk. If you'd like to open a JIRA for this, it
will not be forgotten...
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com
On 13. mars 2012, at 04:25, bing wrote:
Hi, all,
I am using solr-langid(Solr3.5.0) to do language detection, and I hope
multiple languages in one text can be detected.
The example text is:
咖哩起源於印度。印度民間傳說咖哩是佛祖釋迦牟尼所創,由於咖哩的辛辣與香味可以幫助遮掩羊肉的腥騷,此舉即為用以幫助不吃豬肉與牛肉的印度人。在泰米爾語中,「kari」是「醬」的意思。在馬來西亞,kari也稱dal(當在mamak檔)。早期印度被蒙古人所建立的莫臥兒帝國(Mughal
Empire)所統治過,其間從波斯(現今的伊朗)帶來的飲食習慣,從而影響印度人的烹調風格直到現今。
Curry (plural, Curries) is a generic term primarily employed in Western
culture to denote a wide variety of dishes originating in Indian, Pakistani,
Bangladeshi, Sri Lankan, Thai or other Southeast Asian cuisines. Their
common feature is the incorporation of more or less complex combinations of
spices and herbs, usually (but not invariably) including fresh or dried hot
capsicum peppers, commonly called "chili" or "cayenne" peppers.
I want the text can be separated into two parts, and the part in Chinese
goes to "text_zh-tw" while the other one "text_en". Can I do something like
that?
Thank you.
Best Regards,
Bing
--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-solr-langid-Solr3-5-0-detect-multiple-languages-in-one-text-tp3821210p3821210.html
Sent from the Solr - User mailing list archive at Nabble.com.