It looks like the CJK one actually does 2-grams plus a little
processing separate processing on latin text.

That's kind of interesting - in general can I build a custom tokenizer
from existing tokenizers that treats different parts of the input
differently based on the utf-8 range of the characters?  E.g. use a
porter stemmer for stretches of Latin text and n-gram or something
else for CJK?

-Peter

On Tue, Nov 10, 2009 at 9:21 PM, Otis Gospodnetic
<otis_gospodne...@yahoo.com> wrote:
> Yes, that's the n-gram one.  I believe the existing CJK one in Lucene is 
> really just an n-gram tokenizer, so no different than the normal n-gram 
> tokenizer.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: Peter Wolanin <peter.wola...@acquia.com>
>> To: solr-user@lucene.apache.org
>> Sent: Tue, November 10, 2009 7:34:37 PM
>> Subject: Re: any docs on solr.EdgeNGramFilterFactory?
>>
>> So, this is the normal N-gram one?  NGramTokenizerFactory
>>
>> Digging deeper - there are actualy CJK and Chinese tokenizers in the
>> Solr codebase:
>>
>> http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
>> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html
>>
>> The CJK one uses the lucene CJKTokenizer
>> http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html
>>
>> and there seems to be another one even that no one has wrapped into Solr:
>> http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html
>>
>> So seems like the existing options are a little better than I thought,
>> though it would be nice to have some docs on properly configuring
>> these.
>>
>> -Peter
>>
>> On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
>> wrote:
>> > Peter,
>> >
>> > For CJK and n-grams, I think you don't want the *Edge* n-grams, but just
>> n-grams.
>> > Before you take the n-gram route, you may want to look at the smart Chinese
>> analyzer in Lucene contrib (I think it works only for Simplified Chinese) and
>> Sen (on java.net).  I also spotted a Korean analyzer in the wild a few months
>> back.
>> >
>> > Otis
>> > --
>> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: Peter Wolanin
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Tue, November 10, 2009 4:06:52 PM
>> >> Subject: any docs on solr.EdgeNGramFilterFactory?
>> >>
>> >> This fairly recent blog post:
>> >>
>> >>
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>> >>
>> >> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
>> >> for the index.  I don't see any mention of that tokenizer on the Solr
>> >> wiki - is it just waiting to be added, or is there any other
>> >> documentation in addition to the blog post?  In particular, there was
>> >> a thread last year about using an N-gram tokenizer to enable
>> >> reasonable (if not ideal) searching of CJK text, so I'd be curious to
>> >> know how people are configuring their schema (with this tokenizer?)
>> >> for that use case.
>> >>
>> >> Thanks,
>> >>
>> >> Peter
>> >>
>> >> --
>> >> Peter M. Wolanin, Ph.D.
>> >> Momentum Specialist,  Acquia. Inc.
>> >> peter.wola...@acquia.com
>> >
>> >
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Reply via email to