Re: The downsides of not splitting on whitespace in edismax (the old albino elephant prob)

Steve Rowe Wed, 29 Mar 2017 08:36:22 -0700

Thanks Doug, excellent analysis!

In implementing the SOLR-9185 changtes, I considered a compromise approach to 
the term-centric / field-centric axis you describe in the case of differing 
field analysis pipelines: finding common source-text-offset bounded slices in 
all per-field queries, and then producing dismax queries over these slices; 
this is a generalization of what happens in the sow=true case, where slice 
points are pre-determined by whitespace.  However, it looked really complicated 
to maintain source text offsets with queries (if you’re interested, you can see 
an example of the kind of thing I’m talking about in my initial patch on 
<https://issues.apache.org/jira/browse/LUCENE-7533>, which I ultimately decided 
against committing), so I decided to go with per-field dismax when structural 
differences are encountered in the per-field queries.  While I won’t be doing 
any work on this short term, I still think the above-described approach could 
improve the situation in the sow=false/differing-field-analysis case.  Patches 
welcome!


One copy/paste-o in your writeup (I think), illustrating term-centric dismax 
queries:

>> (title:albino | title:albino) OR (text:elephant | text:elephant)


This should instead be:

(title:albino | text:albino) OR (title:elephant | text:elephant)  

--
Steve
www.lucidworks.com

> On Mar 29, 2017, at 10:49 AM, Doug Turnbull 
> <dturnb...@opensourceconnections.com> wrote:
> 
> What triggered me to send this was seeing this
> 
>> When per-field query structures differ, e.g. when one field's analyzer
> removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery
> structure when sow=false differs from that produced when sow=true. Briefly,
> sow=true produces a boolean query containing one dismax query per query
> term, while sow=false produces a dismax query containing one boolean query
> per field. Min-should-match processing does (what I think is) the right
> thing here. See
> TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis() for
> some examples of this. *Note*: when sow=false and all queried fields' query
> structure is the same, edismax does what it has always done: produce a
> boolean query containing one dismax query per term.
> 
> So just be careful because this switches edismax towards a per-field dismax
> (correct me if I'm wrong here) as opposed to per-term. If I understand this
> correctly, you may run into a different set of problems along the albino
> elephant spectrum when sow=true
> 
> On Wed, Mar 29, 2017 at 10:45 AM Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
> 
>> So with regards to this JIRA (
>> https://issues.apache.org/jira/browse/SOLR-9185) Which makes Solr
>> splitting on whitespace optional.
>> 
>> I want to point out that there's not a simple fix to multi-term synonyms
>> in part because of specific tradeoffs. Splitting on whitespace is *someimes
>> a good thing*. Not splitting on whitespace (or enforcing some other
>> cross-field consistent token splitting behavior) actually recreates an old
>> problem that was the reason for creating dismax strategies in the first
>> place. So I'm glad we're leaving the sow option :)
>> 
>> If you're interested, this summarizes a bunch of historical research I did
>> into Lucene code for my book for why splitting on whitespace is often a
>> good thing
>> 
>> Currently the behavior of edismax is intentionally designed to be
>> term-centric. There's a bias towards having more of your query terms in a
>> relevant hit. This comes out of an old problem called "albino elephant"
>> that was the original reason dismax strategies came about. So if a user
>> searches for
>> 
>> albino elephant
>> 
>> The original Lucene query parser for search across fields would do
>> something like:
>> 
>> (title:albino OR title:elephant) OR (text:albino OR text:elephant)
>> 
>> TF*IDF held constant for each term, a document that matches "albino" in
>> two fields has the same value as a document that matches BOTH albino and
>> elephant. Both get 2 "hits" in the OR query above. Most users consder this
>> not good! I want albino elephants, not just albino things nor just elephant
>> things!
>> 
>> So disjunctionmaxquery came about because somebody realized that if they
>> took the per-term maximum, they could bias towards results that had more of
>> the user's search terms.
>> 
>> (title:albino | title:albino) OR (text:elephant | text:elephant)
>> 
>> Here the highest scored result has BOTH search terms. So a result that has
>> both elephant and albino will come to the top. What users typically expect.
>> 
>> I call this strategy "term centric" -- it biases results towards documents
>> with more of the users search terms. I contrast this with "field centric"
>> search which focuses more on the specific analysis/matching behavior of one
>> field (shingles/synonyms/auto phrasing/taxonomies/whatever)
>> 
>> This strategy by necessity requires you to have a consistent, global
>> definition of what's a "search term" independent of fields either by a
>> common analyzer across fields or by just splitting on whitespace. A common
>> analyzer is what BlendedTermQuery in Lucene enforces (used by ES's
>> cross_field search)
>> 
>> In other words splitting on whitespace has *benefits* and *drawbacks.* The
>> drawback is what we experience with Solr multiterm synonyms. If you have
>> one field that breaks up by shingles/some multi-term synonym behavior and
>> another field that tokenizes on whitespace, you can't easily pick the
>> document with the "most search terms" as there's no consistent definition
>> of search terms.
>> 
>> I don't know where I'm going with this, but I want to point out that
>> fixing multiterm synonym won't have a silver bullet. People should still
>> expect to be frustrated :). We should all be aware we likely recreate
>> another problem with a simple fix to multiterm synonym. I think there's
>> value in some strategy that does something like
>> 
>> - Base relevance with edismax, splitting on whitespace to bias towards
>> more search terms
>> - Boosts with edismax w/o splitting on whitespace (or some other QP) to
>> layer in the effects you want for multiterm synonyms
>> 
>> How you balance these ranking signals is tricky and domain specific, but I
>> have found this sort of strategy balances both concerns
>> 
>> Ok this probably should have just been a blog post, but I wanted to just
>> use my history degree for something useful for a change...
>> Best!
>> -Doug
>>

Re: The downsides of not splitting on whitespace in edismax (the old albino elephant prob)

Reply via email to