What triggered me to send this was seeing this > When per-field query structures differ, e.g. when one field's analyzer removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery structure when sow=false differs from that produced when sow=true. Briefly, sow=true produces a boolean query containing one dismax query per query term, while sow=false produces a dismax query containing one boolean query per field. Min-should-match processing does (what I think is) the right thing here. See TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis() for some examples of this. *Note*: when sow=false and all queried fields' query structure is the same, edismax does what it has always done: produce a boolean query containing one dismax query per term.
So just be careful because this switches edismax towards a per-field dismax (correct me if I'm wrong here) as opposed to per-term. If I understand this correctly, you may run into a different set of problems along the albino elephant spectrum when sow=true On Wed, Mar 29, 2017 at 10:45 AM Doug Turnbull < dturnb...@opensourceconnections.com> wrote: > So with regards to this JIRA ( > https://issues.apache.org/jira/browse/SOLR-9185) Which makes Solr > splitting on whitespace optional. > > I want to point out that there's not a simple fix to multi-term synonyms > in part because of specific tradeoffs. Splitting on whitespace is *someimes > a good thing*. Not splitting on whitespace (or enforcing some other > cross-field consistent token splitting behavior) actually recreates an old > problem that was the reason for creating dismax strategies in the first > place. So I'm glad we're leaving the sow option :) > > If you're interested, this summarizes a bunch of historical research I did > into Lucene code for my book for why splitting on whitespace is often a > good thing > > Currently the behavior of edismax is intentionally designed to be > term-centric. There's a bias towards having more of your query terms in a > relevant hit. This comes out of an old problem called "albino elephant" > that was the original reason dismax strategies came about. So if a user > searches for > > albino elephant > > The original Lucene query parser for search across fields would do > something like: > > (title:albino OR title:elephant) OR (text:albino OR text:elephant) > > TF*IDF held constant for each term, a document that matches "albino" in > two fields has the same value as a document that matches BOTH albino and > elephant. Both get 2 "hits" in the OR query above. Most users consder this > not good! I want albino elephants, not just albino things nor just elephant > things! > > So disjunctionmaxquery came about because somebody realized that if they > took the per-term maximum, they could bias towards results that had more of > the user's search terms. > > (title:albino | title:albino) OR (text:elephant | text:elephant) > > Here the highest scored result has BOTH search terms. So a result that has > both elephant and albino will come to the top. What users typically expect. > > I call this strategy "term centric" -- it biases results towards documents > with more of the users search terms. I contrast this with "field centric" > search which focuses more on the specific analysis/matching behavior of one > field (shingles/synonyms/auto phrasing/taxonomies/whatever) > > This strategy by necessity requires you to have a consistent, global > definition of what's a "search term" independent of fields either by a > common analyzer across fields or by just splitting on whitespace. A common > analyzer is what BlendedTermQuery in Lucene enforces (used by ES's > cross_field search) > > In other words splitting on whitespace has *benefits* and *drawbacks.* The > drawback is what we experience with Solr multiterm synonyms. If you have > one field that breaks up by shingles/some multi-term synonym behavior and > another field that tokenizes on whitespace, you can't easily pick the > document with the "most search terms" as there's no consistent definition > of search terms. > > I don't know where I'm going with this, but I want to point out that > fixing multiterm synonym won't have a silver bullet. People should still > expect to be frustrated :). We should all be aware we likely recreate > another problem with a simple fix to multiterm synonym. I think there's > value in some strategy that does something like > > - Base relevance with edismax, splitting on whitespace to bias towards > more search terms > - Boosts with edismax w/o splitting on whitespace (or some other QP) to > layer in the effects you want for multiterm synonyms > > How you balance these ranking signals is tricky and domain specific, but I > have found this sort of strategy balances both concerns > > Ok this probably should have just been a blog post, but I wanted to just > use my history degree for something useful for a change... > Best! > -Doug >