Peter,

For CJK and n-grams, I think you don't want the *Edge* n-grams, but just 
n-grams.
Before you take the n-gram route, you may want to look at the smart Chinese 
analyzer in Lucene contrib (I think it works only for Simplified Chinese) and 
Sen (on java.net).  I also spotted a Korean analyzer in the wild a few months 
back.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Peter Wolanin <peter.wola...@acquia.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, November 10, 2009 4:06:52 PM
> Subject: any docs on solr.EdgeNGramFilterFactory?
> 
> This fairly recent blog post:
> 
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> 
> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
> for the index.  I don't see any mention of that tokenizer on the Solr
> wiki - is it just waiting to be added, or is there any other
> documentation in addition to the blog post?  In particular, there was
> a thread last year about using an N-gram tokenizer to enable
> reasonable (if not ideal) searching of CJK text, so I'd be curious to
> know how people are configuring their schema (with this tokenizer?)
> for that use case.
> 
> Thanks,
> 
> Peter
> 
> -- 
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com

Reply via email to