Hi Shawn,
I may still be missing your point. Below is an example where the
ICUTokenizer splits
Now, I'm beginning to wonder if I really understand what those flags on the
CJKBigramFilter do.
The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them
back together into bigrams.
I thought if you set han=true, hiragana=true
You would get this kind of result where the third bigram is composed of a
hirigana and han character
いろは革命歌 => “いろ” ”ろは“ “は革” ”革命” “命歌”
Hopefully the e-mail hasn't munged the output of the Solr analysis panel
below:
I can see this in our query processing where outpugUnigrams=false:
org.apache.solr.analysis.ICUTokenizerFactory {luceneMatchVersion=LUCENE_36}
Splits into unigrams
term text いろは革命歌
org.apache.solr.analysis.CJKBigramFilterFactory {hangul=false,
outputUnigrams=false, katakana=false, han=true, hiragana=true,
luceneMatchVersion=LUCENE_36}
makes bigrams including the middle one which is one character hirigana and
one han
term text いろろはは革革命命歌
It appears that if you include outputUnigrams=true (as we both do in the
indexing configuration) that this doesn't happen.
org.apache.solr.analysis.CJKBigramFilterFactory {hangul=false,
outputUnigrams=true, katakana=false, han=true, hiragana=true ,
luceneMatchVersion=LUCENE_36}
いろは革命歌 革命命歌 type <HIRAGANA><HIRAGANA><HIRAGANA><SINGLE><SINGLE><SINGLE>
<DOUBLE><DOUBLE>
Not sure what happens for katakana as the ICUTokenizer doesn't convert it
to unigrams and our configuration is set to katakana=false. I'll play
around on the test machine when I have time.
Tom