For what it's worth I worked on indexing and searching a *massive* pile of data, a good portion of which was in CJ and some K. The n-gram approach was used for all 3 languages and the quality of search results, including highlighting was evaluated and okay-ed by native speakers of these languages.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: Walter Underwood <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Tuesday, November 27, 2007 2:41:38 PM Subject: Re: CJK Analyzers for Solr Dictionaries are surprisingly expensive to build and maintain and bi-gram is surprisingly effective for Chinese. See this paper: http://citeseer.ist.psu.edu/kwok97comparing.html I expect that n-gram indexing would be less effective for Japanese because it is an inflected language. Korean is even harder. It might work to break Korean into the phonetic subparts and use n-gram on those. You should not do term highlighting with any of the n-gram methods. The relevance can be very good, but the highlighting just looks dumb. wunder On 11/27/07 8:54 AM, "Eswar K" <[EMAIL PROTECTED]> wrote: > Is there any specific reason why the CJK analyzers in Solr were chosen to be > n-gram based instead of it being a morphological analyzer which is kind of > implemented in Google as it considered to be more effective than the n-gram > ones? > > Regards, > Eswar > > > > On Nov 27, 2007 7:57 AM, Eswar K <[EMAIL PROTECTED]> wrote: > >> thanks james... >> >> How much time does it take to index 18m docs? >> >> - Eswar >> >> >> On Nov 27, 2007 7:43 AM, James liu <[EMAIL PROTECTED] > wrote: >> >>> i not use HYLANDA analyzer. >>> >>> i use je-analyzer and indexing at least 18m docs. >>> >>> i m sorry i only use chinese analyzer. >>> >>> >>> On Nov 27, 2007 10:01 AM, Eswar K <[EMAIL PROTECTED]> wrote: >>> >>>> What is the performance of these CJK analyzers (one in lucene and >>> hylanda >>>> )? >>>> We would potentially be indexing millions of documents. >>>> >>>> James, >>>> >>>> We would have a look at hylanda too. What abt japanese and korean >>>> analyzers, >>>> any recommendations? >>>> >>>> - Eswar >>>> >>>> On Nov 27, 2007 7:21 AM, James liu <[EMAIL PROTECTED]> wrote: >>>> >>>>> I don't think NGram is good method for Chinese. >>>>> >>>>> CJKAnalyzer of Lucene is 2-Gram. >>>>> >>>>> Eswar K: >>>>> if it is chinese analyzer,,i recommend hylanda(www.hylanda.com),,,it >>> is >>>>> the best chinese analyzer and it not free. >>>>> if u wanna free chinese analyzer, maybe u can try je-analyzer. it >>> have >>>>> some problem when using it. >>>>> >>>>> >>>>> >>>>> On Nov 27, 2007 5:56 AM, Otis Gospodnetic < >>> [EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>>> Eswar, >>>>>> >>>>>> We've uses the NGram stuff that exists in Lucene's >>> contrib/analyzers >>>>>> instead of CJK. Doesn't that allow you to do everything that the >>>>> Chinese >>>>>> and CJK analyzers do? It's been a few months since I've looked at >>>>> Chinese >>>>>> and CJK Analzyers, so I could be off. >>>>>> >>>>>> Otis >>>>>> >>>>>> -- >>>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >>>>>> >>>>>> ----- Original Message ---- >>>>>> From: Eswar K <[EMAIL PROTECTED]> >>>>>> To: solr-user@lucene.apache.org >>>>>> Sent: Monday, November 26, 2007 8:30:52 AM >>>>>> Subject: CJK Analyzers for Solr >>>>>> >>>>>> Hi, >>>>>> >>>>>> Does Solr come with Language analyzers for CJK? If not, can you >>> please >>>>>> direct me to some good CJK analyzers? >>>>>> >>>>>> Regards, >>>>>> Eswar >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> regards >>>>> jl >>>>> >>>> >>> >>> >>> >>> -- >>> regards >>> jl >>> >> >>