it seems good. On Dec 3, 2007 1:01 AM, Ken Krugler <[EMAIL PROTECTED]> wrote:
> >Wunder - are you aware of any free dictionaries > >for either C or J or K? When I dealt with this > >in the past, I looked for something free, but > >found only commercial dictionaries. > > I would use data files from: > > http://ftp.monash.edu.au/pub/nihongo/00INDEX.html > > -- Ken > > > >Sematext -- http://sematext.com/ -- Lucene - > >Solr - Nutch ----- Original Message ---- From: > >Walter Underwood <[EMAIL PROTECTED]> To: > >solr-user@lucene.apache.org Sent: Wednesday, > >November 28, 2007 5:43:32 PM Subject: Re: CJK > >Analyzers for Solr With Ultraseek, we switched > >to a dictionary-based segmenter for Chinese > >because the N-gram highlighting wasn't > >acceptable to our Chinese customers. I guess it > >is something to check for each application. > >wunder On 11/27/07 10:46 PM, "Otis Gospodnetic" > ><[EMAIL PROTECTED]> wrote: > For what > >it's worth I worked on indexing and searching a > >*massive* pile of > data, a good portion of > >which was in CJ and some K. The n-gram approach > >was > used for all 3 languages and the quality > >of search results, including > highlighting was > >evaluated and okay-ed by native speakers of > >these languages. > > Otis > -- > Sematext -- > >http://sematext.com/ -- Lucene - Solr - > >Nutch > > ----- Original Message ---- > From: > >Walter Underwood <[EMAIL PROTECTED]> > To: > >solr-user@lucene.apache.org > Sent: Tuesday, > >November 27, 2007 2:41:38 PM > Subject: Re: CJK > >Analyzers for Solr > > Dictionaries are > >surprisingly expensive to build and maintain > >and > bi-gram is surprisingly effective for > >Chinese. See this paper: > > > >http://citeseer.ist.psu.edu/kwok97comparing.html > > > >I expect that n-gram indexing would be less > >effective for Japanese > because it is an > >inflected language. Korean is even harder. It > >might > work to break Korean into the phonetic > >subparts and use n-gram on > those. > > You > >should not do term highlighting with any of the > >n-gram methods. > The relevance can be very > >good, but the highlighting just looks dumb. > > > >wunder > > On 11/27/07 8:54 AM, "Eswar K" > ><[EMAIL PROTECTED]> wrote: > >> Is there any > >specific reason why the CJK analyzers in Solr > >were > chosen to be >> n-gram based instead of > >it being a morphological analyzer which is > > >kind of >> implemented in Google as it > >considered to be more effective than the > > >n-gram >> ones? >> >> Regards, >> > >Eswar >> >> >> >> On Nov 27, 2007 7:57 AM, Eswar > >K <[EMAIL PROTECTED]> wrote: >> >>> thanks > >james... >>> >>> How much time does it take to > >index 18m docs? >>> >>> - Eswar >>> >>> >>> On > >Nov 27, 2007 7:43 AM, James liu > ><[EMAIL PROTECTED] > wrote: >>> >>>> i not > >use HYLANDA analyzer. >>>> >>>> i use > >je-analyzer and indexing at least 18m > >docs. >>>> >>>> i m sorry i only use chinese > >analyzer. >>>> >>>> >>>> On Nov 27, 2007 10:01 > >AM, Eswar K <[EMAIL PROTECTED]> > >wrote: >>>> >>>>> What is the performance of > >these CJK analyzers (one in lucene and >>>> > >hylanda >>>>> )? >>>>> We would potentially be > >indexing millions of documents. >>>>> >>>>> > >James, >>>>> >>>>> We would have a look at > >hylanda too. What abt japanese and korean >>>>> > >analyzers, >>>>> any > >recommendations? >>>>> >>>>> - Eswar >>>>> >>>>> > >On Nov 27, 2007 7:21 AM, James liu > ><[EMAIL PROTECTED]> > wrote: >>>>> >>>>>> > >I don't think NGram is good method for > >Chinese. >>>>>> >>>>>> CJKAnalyzer of Lucene is > >2-Gram. >>>>>> >>>>>> Eswar K: >>>>>> if it is > >chinese analyzer,,i recommend > > >hylandaÅiwww.hylanda.comÅj,,,it >>>> is >>>>>> > >the best chinese analyzer and it not > >free. >>>>>> if u wanna free chinese analyzer, > >maybe u can try je-analyzer. > it >>>> > >have >>>>>> some problem when using > >it. >>>>>> >>>>>> >>>>>> >>>>>> On Nov 27, 2007 > >5:56 AM, Otis Gospodnetic < >>>> > >[EMAIL PROTECTED]> >>>>>> > >wrote: >>>>>> >>>>>>> Eswar, >>>>>>> >>>>>>> > >We've uses the NGram stuff that exists in > >Lucene's >>>> contrib/analyzers >>>>>>> instead > >of CJK. Doesn't that allow you to do everything > >that > the >>>>>> Chinese >>>>>>> and CJK > >analyzers do? It's been a few months since I've > >looked > at >>>>>> Chinese >>>>>>> and CJK > >Analzyers, so I could be off. >>>>>>> >>>>>>> > >Otis >>>>>>> >>>>>>> -- >>>>>>> Sematext -- > >http://sematext.com/ -- Lucene - Solr - > >Nutch >>>>>>> >>>>>>> ----- Original Message > >---- >>>>>>> From: Eswar K > ><[EMAIL PROTECTED]> >>>>>>> To: > >solr-user@lucene.apache.org >>>>>>> Sent: > >Monday, November 26, 2007 8:30:52 AM >>>>>>> > >Subject: CJK Analyzers for Solr >>>>>>> >>>>>>> > >Hi, >>>>>>> >>>>>>> Does Solr come with Language > >analyzers for CJK? If not, can you >>>> > >please >>>>>>> direct me to some good CJK > >analyzers? >>>>>>> >>>>>>> Regards, >>>>>>> > >Eswar >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> > >-- >>>>>> regards >>>>>> > >jl >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> > >regards >>>> jl >>>> >>> >>> > > > > > > > -- > Ken Krugler > Krugle, Inc. > +1 530-210-6378 > "If you can't find it, you can't fix it" -- regards jl