fyi, you can use the block property,but I think even better is to use the unicode script property: http://unicode.org/reports/tr24/ . This is easier because some characters are common across different scripts. Also, some scripts span multiple unicode blocks.
This is the direction I was heading LUCENE-1488, based upon the script, tokenize text in different ways, etc. I think the last patch I uploaded puts it in the token flags as well. On Thu, Aug 6, 2009 at 6:44 PM, Cheolgoo Kang<app...@gmail.com> wrote: > Is that 'blocks of text' is a (unicode) Java string? I don't think > this is the case, but then, use Character.UnicodeBlock to identify the > language of the text. > > And, is that just text files with unknown character encoding? Then ICU > has a 'charset detector' that you can use. This feature 'suggests' a > charset (with some probability values) from a byte stream. I don't > know about it's performance on accuracy and speed. Go to the website > http://userguide.icu-project.org/conversion/detection. > > Hope it helps. > > - Cheolgoo Kang > > > > On Fri, Aug 7, 2009 at 4:46 AM, Bradford > Stephens<bradfordsteph...@gmail.com> wrote: >> Hey there, >> >> We're trying to add foreign language support into our new search >> engine -- languages like Arabic, Farsi, and Urdu (that don't work with >> standard analyzers). But our data source doesn't tell us which >> languages we're actually collecting -- we just get blocks of text. Has >> anyone here worked on language detection so we can figure out what >> analyzers to use? Are there commercial solutions? >> >> Much appreciated! >> >> -- >> http://www.roadtofailure.com -- The Fringes of Scalability, Social >> Media, and Computer Science >> > -- Robert Muir rcm...@gmail.com