With Ultraseek, we switched to a dictionary-based segmenter for Chinese because the N-gram highlighting wasn't acceptable to our Chinese customers.
I guess it is something to check for each application. wunder On 11/27/07 10:46 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote: > For what it's worth I worked on indexing and searching a *massive* pile of > data, a good portion of which was in CJ and some K. The n-gram approach was > used for all 3 languages and the quality of search results, including > highlighting was evaluated and okay-ed by native speakers of these languages. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > ----- Original Message ---- > From: Walter Underwood <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Tuesday, November 27, 2007 2:41:38 PM > Subject: Re: CJK Analyzers for Solr > > Dictionaries are surprisingly expensive to build and maintain and > bi-gram is surprisingly effective for Chinese. See this paper: > > http://citeseer.ist.psu.edu/kwok97comparing.html > > I expect that n-gram indexing would be less effective for Japanese > because it is an inflected language. Korean is even harder. It might > work to break Korean into the phonetic subparts and use n-gram on > those. > > You should not do term highlighting with any of the n-gram methods. > The relevance can be very good, but the highlighting just looks dumb. > > wunder > > On 11/27/07 8:54 AM, "Eswar K" <[EMAIL PROTECTED]> wrote: > >> Is there any specific reason why the CJK analyzers in Solr were > chosen to be >> n-gram based instead of it being a morphological analyzer which is > kind of >> implemented in Google as it considered to be more effective than the > n-gram >> ones? >> >> Regards, >> Eswar >> >> >> >> On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote: >> >>> thanks james... >>> >>> How much time does it take to index 18m docs? >>> >>> - Eswar >>> >>> >>> On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote: >>> >>>> i not use HYLANDA analyzer. >>>> >>>> i use je-analyzer and indexing at least 18m docs. >>>> >>>> i m sorry i only use chinese analyzer. >>>> >>>> >>>> On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote: >>>> >>>>> What is the performance of these CJK analyzers (one in lucene and >>>> hylanda >>>>> )? >>>>> We would potentially be indexing millions of documents. >>>>> >>>>> James, >>>>> >>>>> We would have a look at hylanda too. What abt japanese and korean >>>>> analyzers, >>>>> any recommendations? >>>>> >>>>> - Eswar >>>>> >>>>> On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> > wrote: >>>>> >>>>>> I don't think NGram is good method for Chinese. >>>>>> >>>>>> CJKAnalyzer of Lucene is 2-Gram. >>>>>> >>>>>> Eswar K: >>>>>> if it is chinese analyzer,,i recommend > hylanda(www.hylanda.com),,,it >>>> is >>>>>> the best chinese analyzer and it not free. >>>>>> if u wanna free chinese analyzer, maybe u can try je-analyzer. > it >>>> have >>>>>> some problem when using it. >>>>>> >>>>>> >>>>>> >>>>>> On Nov 27, 2007 5:56 AM, Otis Gospodnetic < >>>> [EMAIL PROTECTED]> >>>>>> wrote: >>>>>> >>>>>>> Eswar, >>>>>>> >>>>>>> We've uses the NGram stuff that exists in Lucene's >>>> contrib/analyzers >>>>>>> instead of CJK. Doesn't that allow you to do everything that > the >>>>>> Chinese >>>>>>> and CJK analyzers do? It's been a few months since I've looked > at >>>>>> Chinese >>>>>>> and CJK Analzyers, so I could be off. >>>>>>> >>>>>>> Otis >>>>>>> >>>>>>> -- >>>>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >>>>>>> >>>>>>> ----- Original Message ---- >>>>>>> From: Eswar K <[EMAIL PROTECTED]> >>>>>>> To: solr-user@lucene.apache.org >>>>>>> Sent: Monday, November 26, 2007 8:30:52 AM >>>>>>> Subject: CJK Analyzers for Solr >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Does Solr come with Language analyzers for CJK? If not, can you >>>> please >>>>>>> direct me to some good CJK analyzers? >>>>>>> >>>>>>> Regards, >>>>>>> Eswar >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> regards >>>>>> jl >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> regards >>>> jl >>>> >>> >>> > > > >