So with regards to this JIRA ( https://issues.apache.org/jira/browse/SOLR-9185) Which makes Solr splitting on whitespace optional.
I want to point out that there's not a simple fix to multi-term synonyms in part because of specific tradeoffs. Splitting on whitespace is *someimes a good thing*. Not splitting on whitespace (or enforcing some other cross-field consistent token splitting behavior) actually recreates an old problem that was the reason for creating dismax strategies in the first place. So I'm glad we're leaving the sow option :) If you're interested, this summarizes a bunch of historical research I did into Lucene code for my book for why splitting on whitespace is often a good thing Currently the behavior of edismax is intentionally designed to be term-centric. There's a bias towards having more of your query terms in a relevant hit. This comes out of an old problem called "albino elephant" that was the original reason dismax strategies came about. So if a user searches for albino elephant The original Lucene query parser for search across fields would do something like: (title:albino OR title:elephant) OR (text:albino OR text:elephant) TF*IDF held constant for each term, a document that matches "albino" in two fields has the same value as a document that matches BOTH albino and elephant. Both get 2 "hits" in the OR query above. Most users consder this not good! I want albino elephants, not just albino things nor just elephant things! So disjunctionmaxquery came about because somebody realized that if they took the per-term maximum, they could bias towards results that had more of the user's search terms. (title:albino | title:albino) OR (text:elephant | text:elephant) Here the highest scored result has BOTH search terms. So a result that has both elephant and albino will come to the top. What users typically expect. I call this strategy "term centric" -- it biases results towards documents with more of the users search terms. I contrast this with "field centric" search which focuses more on the specific analysis/matching behavior of one field (shingles/synonyms/auto phrasing/taxonomies/whatever) This strategy by necessity requires you to have a consistent, global definition of what's a "search term" independent of fields either by a common analyzer across fields or by just splitting on whitespace. A common analyzer is what BlendedTermQuery in Lucene enforces (used by ES's cross_field search) In other words splitting on whitespace has *benefits* and *drawbacks.* The drawback is what we experience with Solr multiterm synonyms. If you have one field that breaks up by shingles/some multi-term synonym behavior and another field that tokenizes on whitespace, you can't easily pick the document with the "most search terms" as there's no consistent definition of search terms. I don't know where I'm going with this, but I want to point out that fixing multiterm synonym won't have a silver bullet. People should still expect to be frustrated :). We should all be aware we likely recreate another problem with a simple fix to multiterm synonym. I think there's value in some strategy that does something like - Base relevance with edismax, splitting on whitespace to bias towards more search terms - Boosts with edismax w/o splitting on whitespace (or some other QP) to layer in the effects you want for multiterm synonyms How you balance these ranking signals is tricky and domain specific, but I have found this sort of strategy balances both concerns Ok this probably should have just been a blog post, but I wanted to just use my history degree for something useful for a change... Best! -Doug