Re: CJK Analyzers for Solr

Walter Underwood Wed, 28 Nov 2007 08:44:09 -0800

With Ultraseek, we switched to a dictionary-based segmenter for Chinese
because the N-gram highlighting wasn't acceptable to our Chinese customers.


I guess it is something to check for each application.

wunder

On 11/27/07 10:46 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:

> For what it's worth I worked on indexing and searching a *massive* pile of
> data, a good portion of which was in CJ and some K.  The n-gram approach was
> used for all 3 languages and the quality of search results, including
> highlighting was evaluated and okay-ed by native speakers of these languages.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> ----- Original Message ----
> From: Walter Underwood <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, November 27, 2007 2:41:38 PM
> Subject: Re: CJK Analyzers for Solr
> 
> Dictionaries are surprisingly expensive to build and maintain and
> bi-gram is surprisingly effective for Chinese. See this paper:
> 
>    http://citeseer.ist.psu.edu/kwok97comparing.html
> 
> I expect that n-gram indexing would be less effective for Japanese
> because it is an inflected language. Korean is even harder. It might
> work to break Korean into the phonetic subparts and use n-gram on
> those.
> 
> You should not do term highlighting with any of the n-gram methods.
> The relevance can be very good, but the highlighting just looks dumb.
> 
> wunder
> 
> On 11/27/07 8:54 AM, "Eswar K" <[EMAIL PROTECTED]> wrote:
> 
>> Is there any specific reason why the CJK analyzers in Solr were
>  chosen to be
>> n-gram based instead of it being a morphological analyzer which is
>  kind of
>> implemented in Google as it considered to be more effective than the
>  n-gram
>> ones?
>> 
>> Regards,
>> Eswar
>> 
>> 
>> 
>> On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote:
>> 
>>> thanks james...
>>> 
>>> How much time does it take to index 18m docs?
>>> 
>>> - Eswar
>>> 
>>> 
>>> On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote:
>>> 
>>>> i not use HYLANDA analyzer.
>>>> 
>>>> i use je-analyzer and indexing at least 18m docs.
>>>> 
>>>> i m sorry i only use chinese analyzer.
>>>> 
>>>> 
>>>> On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote:
>>>> 
>>>>> What is the performance of these CJK analyzers (one in lucene and
>>>> hylanda
>>>>> )?
>>>>> We would potentially be indexing millions of documents.
>>>>> 
>>>>> James,
>>>>> 
>>>>> We would have a look at hylanda too. What abt japanese and korean
>>>>> analyzers,
>>>>> any recommendations?
>>>>> 
>>>>> - Eswar
>>>>> 
>>>>> On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]>
>  wrote:
>>>>> 
>>>>>> I don't think NGram is good method for Chinese.
>>>>>> 
>>>>>> CJKAnalyzer of Lucene is 2-Gram.
>>>>>> 
>>>>>> Eswar K:
>>>>>>  if it is chinese analyzer,,i recommend
>  hylanda（www.hylanda.com）,,,it
>>>> is
>>>>>> the best chinese analyzer and it not free.
>>>>>>  if u wanna free chinese analyzer, maybe u can try je-analyzer.
>  it
>>>> have
>>>>>> some problem when using it.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Nov 27, 2007 5:56 AM, Otis Gospodnetic <
>>>> [EMAIL PROTECTED]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Eswar,
>>>>>>> 
>>>>>>> We've uses the NGram stuff that exists in Lucene's
>>>> contrib/analyzers
>>>>>>> instead of CJK.  Doesn't that allow you to do everything that
>  the
>>>>>> Chinese
>>>>>>> and CJK analyzers do?  It's been a few months since I've looked
>  at
>>>>>> Chinese
>>>>>>> and CJK Analzyers, so I could be off.
>>>>>>> 
>>>>>>> Otis
>>>>>>> 
>>>>>>> --
>>>>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>>>>> 
>>>>>>> ----- Original Message ----
>>>>>>> From: Eswar K <[EMAIL PROTECTED]>
>>>>>>> To: solr-user@lucene.apache.org
>>>>>>> Sent: Monday, November 26, 2007 8:30:52 AM
>>>>>>> Subject: CJK Analyzers for Solr
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Does Solr come with Language analyzers for CJK? If not, can you
>>>> please
>>>>>>> direct me to some good CJK analyzers?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Eswar
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> regards
>>>>>> jl
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> regards
>>>> jl
>>>> 
>>> 
>>> 
> 
> 
> 
>

Re: CJK Analyzers for Solr

Reply via email to