Peter, For CJK and n-grams, I think you don't want the *Edge* n-grams, but just n-grams. Before you take the n-gram route, you may want to look at the smart Chinese analyzer in Lucene contrib (I think it works only for Simplified Chinese) and Sen (on java.net). I also spotted a Korean analyzer in the wild a few months back.
Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR ----- Original Message ---- > From: Peter Wolanin <peter.wola...@acquia.com> > To: solr-user@lucene.apache.org > Sent: Tue, November 10, 2009 4:06:52 PM > Subject: any docs on solr.EdgeNGramFilterFactory? > > This fairly recent blog post: > > http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ > > describes the use of the solr.EdgeNGramFilterFactory as the tokenizer > for the index. I don't see any mention of that tokenizer on the Solr > wiki - is it just waiting to be added, or is there any other > documentation in addition to the blog post? In particular, there was > a thread last year about using an N-gram tokenizer to enable > reasonable (if not ideal) searching of CJK text, so I'd be curious to > know how people are configuring their schema (with this tokenizer?) > for that use case. > > Thanks, > > Peter > > -- > Peter M. Wolanin, Ph.D. > Momentum Specialist, Acquia. Inc. > peter.wola...@acquia.com