Re: CJK Analyzers for Solr

Otis Gospodnetic Tue, 27 Nov 2007 22:46:58 -0800

For what it's worth I worked on indexing and searching a *massive* pile of 
data, a good portion of which was in CJ and some K.  The n-gram approach was 
used for all 3 languages and the quality of search results, including 
highlighting was evaluated and okay-ed by native speakers of these languages.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Walter Underwood <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, November 27, 2007 2:41:38 PM
Subject: Re: CJK Analyzers for Solr

Dictionaries are surprisingly expensive to build and maintain and
bi-gram is surprisingly effective for Chinese. See this paper:

   http://citeseer.ist.psu.edu/kwok97comparing.html

I expect that n-gram indexing would be less effective for Japanese
because it is an inflected language. Korean is even harder. It might
work to break Korean into the phonetic subparts and use n-gram on
those.

You should not do term highlighting with any of the n-gram methods.
The relevance can be very good, but the highlighting just looks dumb.

wunder

On 11/27/07 8:54 AM, "Eswar K" <[EMAIL PROTECTED]> wrote:

> Is there any specific reason why the CJK analyzers in Solr were
 chosen to be
> n-gram based instead of it being a morphological analyzer which is
 kind of
> implemented in Google as it considered to be more effective than the
 n-gram
> ones?
> 
> Regards,
> Eswar
> 
> 
> 
> On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote:
> 
>> thanks james...
>> 
>> How much time does it take to index 18m docs?
>> 
>> - Eswar
>> 
>> 
>> On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote:
>> 
>>> i not use HYLANDA analyzer.
>>> 
>>> i use je-analyzer and indexing at least 18m docs.
>>> 
>>> i m sorry i only use chinese analyzer.
>>> 
>>> 
>>> On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:
>>> 
>>>> What is the performance of these CJK analyzers (one in lucene and
>>> hylanda
>>>> )?
>>>> We would potentially be indexing millions of documents.
>>>> 
>>>> James,
>>>> 
>>>> We would have a look at hylanda too. What abt japanese and korean
>>>> analyzers,
>>>> any recommendations?
>>>> 
>>>> - Eswar
>>>> 
>>>> On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]>
 wrote:
>>>> 
>>>>> I don't think NGram is good method for Chinese.
>>>>> 
>>>>> CJKAnalyzer of Lucene is 2-Gram.
>>>>> 
>>>>> Eswar K:
>>>>>  if it is chinese analyzer,,i recommend
 hylanda（www.hylanda.com）,,,it
>>> is
>>>>> the best chinese analyzer and it not free.
>>>>>  if u wanna free chinese analyzer, maybe u can try je-analyzer.
 it
>>> have
>>>>> some problem when using it.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Nov 27, 2007 5:56 AM, Otis Gospodnetic <
>>> [EMAIL PROTECTED]>
>>>>> wrote:
>>>>> 
>>>>>> Eswar,
>>>>>> 
>>>>>> We've uses the NGram stuff that exists in Lucene's
>>> contrib/analyzers
>>>>>> instead of CJK.  Doesn't that allow you to do everything that
 the
>>>>> Chinese
>>>>>> and CJK analyzers do?  It's been a few months since I've looked
 at
>>>>> Chinese
>>>>>> and CJK Analzyers, so I could be off.
>>>>>> 
>>>>>> Otis
>>>>>> 
>>>>>> --
>>>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>>>> 
>>>>>> ----- Original Message ----
>>>>>> From: Eswar K <[EMAIL PROTECTED]>
>>>>>> To: solr-user@lucene.apache.org
>>>>>> Sent: Monday, November 26, 2007 8:30:52 AM
>>>>>> Subject: CJK Analyzers for Solr
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Does Solr come with Language analyzers for CJK? If not, can you
>>> please
>>>>>> direct me to some good CJK analyzers?
>>>>>> 
>>>>>> Regards,
>>>>>> Eswar
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> regards
>>>>> jl
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> regards
>>> jl
>>> 
>> 
>>

Re: CJK Analyzers for Solr

Reply via email to