Speaking from experience: if you are using bigrams for CJK, do not highlight. The results will look very wrong to someone who knows the language.
Even with a dictionary-based tokenizer, you'll need a client dictionary for local terms. wunder On Jan 2, 2013, at 10:51 AM, Tom Burton-West wrote: > Hello all, > > What are the best practices for setting up the highlighter to work with CJK? > We are using the ICUTokenizer with the CJKBigramFilter, so overlapping > bigrams are what are actually being searched. However the highlighter seems > to only highlight the first of any two overlapping bigrams. i.e. ABC => > searched as AB BC only AB gets highlighted even if the matching string is > ABC. (Where ABC are chinese characters such as 大亚湾 => searched as 大亚 亚湾, > but only 大亚 is highlighted rather than 大亚湾) > > Is there some highlighting parameter that might fix this? > > Tom Burton-West