Re: CJK Analyzers for Solr

James liu Mon, 03 Dec 2007 18:11:58 -0800

it seems good.

On Dec 3, 2007 1:01 AM, Ken Krugler <[EMAIL PROTECTED]> wrote:


> >Wunder - are you aware of any free dictionaries
> >for either C or J or K?  When I dealt with this
> >in the past, I looked for something free, but
> >found only commercial dictionaries.
>
> I would use data files from:
>
> http://ftp.monash.edu.au/pub/nihongo/00INDEX.html
>
> -- Ken
>
>
> >Sematext -- http://sematext.com/ -- Lucene -
> >Solr - Nutch ----- Original Message ---- From:
> >Walter Underwood <[EMAIL PROTECTED]> To:
> >solr-user@lucene.apache.org Sent: Wednesday,
> >November 28, 2007 5:43:32 PM Subject: Re: CJK
> >Analyzers for Solr With Ultraseek, we switched
> >to a dictionary-based segmenter for Chinese
> >because the N-gram highlighting wasn't
> >acceptable to our Chinese customers. I guess it
> >is something to check for each application.
> >wunder On 11/27/07 10:46 PM, "Otis Gospodnetic"
> ><[EMAIL PROTECTED]> wrote: > For what
> >it's worth I worked on indexing and searching a
> >*massive* pile of > data, a good portion of
> >which was in CJ and some K.  The n-gram approach
> >was > used for all 3 languages and the quality
> >of search results, including > highlighting was
> >evaluated and okay-ed by native speakers of
> >these languages. > > Otis > -- > Sematext --
> >http://sematext.com/ -- Lucene - Solr -
> >Nutch > > ----- Original Message ---- > From:
> >Walter Underwood <[EMAIL PROTECTED]> > To:
> >solr-user@lucene.apache.org > Sent: Tuesday,
> >November 27, 2007 2:41:38 PM > Subject: Re: CJK
> >Analyzers for Solr > > Dictionaries are
> >surprisingly expensive to build and maintain
> >and > bi-gram is surprisingly effective for
> >Chinese. See this paper: > >
> >http://citeseer.ist.psu.edu/kwok97comparing.html > >
> >I expect that n-gram indexing would be less
> >effective for Japanese > because it is an
> >inflected language. Korean is even harder. It
> >might > work to break Korean into the phonetic
> >subparts and use n-gram on > those. > > You
> >should not do term highlighting with any of the
> >n-gram methods. > The relevance can be very
> >good, but the highlighting just looks dumb. > >
> >wunder > > On 11/27/07 8:54 AM, "Eswar K"
> ><[EMAIL PROTECTED]> wrote: > >> Is there any
> >specific reason why the CJK analyzers in Solr
> >were >  chosen to be >> n-gram based instead of
> >it being a morphological analyzer which is >
> >kind of >> implemented in Google as it
> >considered to be more effective than the >
> >n-gram >> ones? >> >> Regards, >>
> >Eswar >> >> >> >> On Nov 27, 2007 7:57 AM, Eswar
> >K <[EMAIL PROTECTED]> wrote: >> >>> thanks
> >james... >>> >>> How much time does it take to
> >index 18m docs? >>> >>> - Eswar >>> >>> >>> On
> >Nov 27, 2007 7:43 AM, James liu
> ><[EMAIL PROTECTED] > wrote: >>> >>>> i not
> >use HYLANDA analyzer. >>>> >>>> i use
> >je-analyzer and indexing at least 18m
> >docs. >>>> >>>> i m sorry i only use chinese
> >analyzer. >>>> >>>> >>>> On Nov 27, 2007 10:01
> >AM, Eswar K <[EMAIL PROTECTED]>
> >wrote: >>>> >>>>> What is the performance of
> >these CJK analyzers (one in lucene and >>>>
> >hylanda >>>>> )? >>>>> We would potentially be
> >indexing millions of documents. >>>>> >>>>>
> >James, >>>>> >>>>> We would have a look at
> >hylanda too. What abt japanese and korean >>>>>
> >analyzers, >>>>> any
> >recommendations? >>>>> >>>>> - Eswar >>>>> >>>>>
> >On Nov 27, 2007 7:21 AM, James liu
> ><[EMAIL PROTECTED]> >  wrote: >>>>> >>>>>>
> >I don't think NGram is good method for
> >Chinese. >>>>>> >>>>>> CJKAnalyzer of Lucene is
> >2-Gram. >>>>>> >>>>>> Eswar K: >>>>>>  if it is
> >chinese analyzer,,i recommend >
> >hylandaÅiwww.hylanda.comÅj,,,it >>>> is >>>>>>
> >the best chinese analyzer and it not
> >free. >>>>>>  if u wanna free chinese analyzer,
> >maybe u can try je-analyzer. >  it >>>>
> >have >>>>>> some problem when using
> >it. >>>>>> >>>>>> >>>>>> >>>>>> On Nov 27, 2007
> >5:56 AM, Otis Gospodnetic < >>>>
> >[EMAIL PROTECTED]> >>>>>>
> >wrote: >>>>>> >>>>>>> Eswar, >>>>>>> >>>>>>>
> >We've uses the NGram stuff that exists in
> >Lucene's >>>> contrib/analyzers >>>>>>> instead
> >of CJK.  Doesn't that allow you to do everything
> >that >  the >>>>>> Chinese >>>>>>> and CJK
> >analyzers do?  It's been a few months since I've
> >looked >  at >>>>>> Chinese >>>>>>> and CJK
> >Analzyers, so I could be off. >>>>>>> >>>>>>>
> >Otis >>>>>>> >>>>>>> -- >>>>>>> Sematext --
> >http://sematext.com/ -- Lucene - Solr -
> >Nutch >>>>>>> >>>>>>> ----- Original Message
> >---- >>>>>>> From: Eswar K
> ><[EMAIL PROTECTED]> >>>>>>> To:
> >solr-user@lucene.apache.org >>>>>>> Sent:
> >Monday, November 26, 2007 8:30:52 AM >>>>>>>
> >Subject: CJK Analyzers for Solr >>>>>>> >>>>>>>
> >Hi, >>>>>>> >>>>>>> Does Solr come with Language
> >analyzers for CJK? If not, can you >>>>
> >please >>>>>>> direct me to some good CJK
> >analyzers? >>>>>>> >>>>>>> Regards, >>>>>>>
> >Eswar >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>>
> >-- >>>>>> regards >>>>>>
> >jl >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>>
> >regards >>>> jl >>>> >>> >>> > > > >
>
>
> --
> Ken Krugler
> Krugle, Inc.
> +1 530-210-6378
> "If you can't find it, you can't fix it"




-- 
regards
jl

Re: CJK Analyzers for Solr

Reply via email to