fyi, you can use the block property,but I think even better is to use
the unicode script property: http://unicode.org/reports/tr24/ . This
is easier because some characters are common across different scripts.
Also, some scripts span multiple unicode blocks.

This is the direction I was heading LUCENE-1488, based upon the
script, tokenize text in different ways, etc.  I think the last patch
I uploaded puts it in the token flags as well.

On Thu, Aug 6, 2009 at 6:44 PM, Cheolgoo Kang<app...@gmail.com> wrote:
> Is that 'blocks of text' is a (unicode) Java string? I don't think
> this is the case, but then, use Character.UnicodeBlock to identify the
> language of the text.
>
> And, is that just text files with unknown character encoding? Then ICU
> has a 'charset detector' that you can use. This feature 'suggests' a
> charset (with some probability values) from a byte stream. I don't
> know about it's performance on accuracy and speed. Go to the website
> http://userguide.icu-project.org/conversion/detection.
>
> Hope it helps.
>
> - Cheolgoo Kang
>
>
>
> On Fri, Aug 7, 2009 at 4:46 AM, Bradford
> Stephens<bradfordsteph...@gmail.com> wrote:
>> Hey there,
>>
>> We're trying to add foreign language support into our new search
>> engine -- languages like Arabic, Farsi, and Urdu (that don't work with
>> standard analyzers). But our data source doesn't tell us which
>> languages we're actually collecting -- we just get blocks of text. Has
>> anyone here worked on language detection so we can figure out what
>> analyzers to use? Are there commercial solutions?
>>
>> Much appreciated!
>>
>> --
>> http://www.roadtofailure.com -- The Fringes of Scalability, Social
>> Media, and Computer Science
>>
>



-- 
Robert Muir
rcm...@gmail.com

Reply via email to