(resending to solr-usr per Chris H; all Asian characters removed from examples to avoid filter) I'm getting phrase queries instead of implicit "OR" queries with Asian text. I first noticed it with the Dismax query handler, but it also happens with the Standard query.
Of course Asian text is broken up into N-Gram pairs, I understand that. But after analysis (via the Web UI) the 2-character "words" still have spaces in between them, so I'd expect similar results to an English sentence which also has spaces. English: (default field title_en) User Query: I need help with my iPod Generates: title_en:i title_en:need title_en:help title_en:with title_en:my title_en:ipod Japanese: (default field title_cjk) User Query: iPodC1C2C3C4C5C6C7... Generates: PhraseQuery(title_cjk:"ipod C1C2 C2C3 C3C4 C4C5 C5C6 C6C7") The problem is the cjk phrase queries are too rigid, everything has to match. Although setting phrase slop helps with proximity, I don't think you can tell it to not require 100% of the bigrams to be present. What I'd like is just: title_cjk:ipod title_cjk:C1C2 title_cjk:C2C3 title_cjk:C3C4 etc... The only theory I have so far, looking through the code and mailing list comments, this might have something to do with token offsets? Though the start of each token is 1 past the previous one, they do overlap by 1 char each time. I'm not sure that's it, nor what the logic would be. Bumping the increments from 1 to 3 or 4 would make them no longer overlap, if that's all there is to it. Ideally I'd like the cjk queries to be structured the same as the English ones. Also it'd be better if this could be done with just schema or config changes, though I realize that's not as likely. -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513