Re: Want zero results from SOLR when there are no matches for "querystring"

Erick Erickson Fri, 12 Aug 2016 12:22:52 -0700

Maybe rerankqparserplugin?

On Aug 12, 2016 11:54, "John Bickerstaff" <j...@johnbickerstaff.com> wrote:


> @Hossman --  thanks again.
>
> I've made the following change and so far things look good.  I couldn't see
> debug or find results for what I put in for $func, so I just removed it,
> but making modifications as you suggested appears to be working.
>
> Including the actual line from my endpoint XML in case this thread helps
> someone else...
>
> <str name="q">{!boost defType=synonym_edismax qf='title' synonyms='true'
> synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
> v=$q}</str>
>
> On Fri, Aug 12, 2016 at 12:09 PM, John Bickerstaff <
> j...@johnbickerstaff.com
> > wrote:
>
> > Thanks!  I'll check it out.
> >
> > On Fri, Aug 12, 2016 at 12:05 PM, Susheel Kumar <susheel2...@gmail.com>
> > wrote:
> >
> >> Not exactly sure what you are looking from chaining the results but
> >> similar
> >> functionality is available in Streaming expressions where result of
> inner
> >> expressions are passed to outer expressions and so on
> >> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
> >>
> >> HTH
> >> Susheel
> >>
> >> On Fri, Aug 12, 2016 at 1:08 PM, John Bickerstaff <
> >> j...@johnbickerstaff.com>
> >> wrote:
> >>
> >> > Hossman - many thanks again for your comprehensive and very helpful
> >> answer!
> >> >
> >> > All,
> >> >
> >> > I am (possibly mis-remembering) reading something about being able to
> >> pass
> >> > the results of one query to another query...  Essentially "chaining"
> >> result
> >> > sets.
> >> >
> >> > I have looked in docs and can't find anything on a quick search -- I
> may
> >> > have been reading about the Re-Ranking feature, which doesn't help me
> (I
> >> > know because I just tried and it seems to return all results anyway,
> >> just
> >> > re-ranking the number specified in the reRankDocs flag...)
> >> >
> >> > Is there a way to (cleanly) send the results of one query to another
> >> query
> >> > for further processing?  Essentially, pass ONLY the results (including
> >> an
> >> > empty set of results) to another query for processing?
> >> >
> >> > thanks...
> >> >
> >> > On Thu, Aug 11, 2016 at 6:19 PM, John Bickerstaff <
> >> > j...@johnbickerstaff.com>
> >> > wrote:
> >> >
> >> > > Thanks!
> >> > >
> >> > > To answer your questions, while I digest the rest of that
> >> information...
> >> > >
> >> > > I'm using the hon-lucene-synonyms.5.0.4.jar from here:
> >> > > https://github.com/healthonnet/hon-lucene-synonyms
> >> > >
> >> > > The config looks like this - and IIRC, is simply a copy from the
> >> > > recommended cofig on the site mentioned above.
> >> > >
> >> > >  <queryParser name="synonym_edismax" class="com.github.healthonnet.
> >> > search.
> >> > > SynonymExpandingExtendedDismaxQParserPlugin">
> >> > >     <!-- You can define more than one synonym analyzer in the
> >> following
> >> > > list.
> >> > >          For example, you might have one set of synonyms for
> English,
> >> one
> >> > > for French,
> >> > >          one for Spanish, etc.
> >> > >       -->
> >> > >     <lst name="synonymAnalyzers">
> >> > >       <!-- Name your analyzer something useful, e.g. "analyzer_en",
> >> > > "analyzer_fr", "analyzer_es", etc.
> >> > >            If you only have one, the name doesn't matter (hence
> >> > > "myCoolAnalyzer").
> >> > >         -->
> >> > >       <lst name="myCoolAnalyzer">
> >> > >         <!-- We recommend a PatternTokenizerFactory that tokenizes
> >> based
> >> > > on whitespace and quotes.
> >> > >              This seems to work best with most people's synonym
> files.
> >> > >              For details, read the discussion here:
> >> > > http://github.com/healthonnet/hon-lucene-synonyms/issues/26
> >> > >           -->
> >> > >         <lst name="tokenizer">
> >> > >           <str name="class">solr.PatternTokenizerFactory</str>
> >> > >           <str name="pattern"><![CDATA[(?:\s|\")+]]></str>
> >> > >         </lst>
> >> > >         <!-- The ShingleFilterFactory outputs synonyms of multiple
> >> token
> >> > > lengths (e.g. unigrams, bigrams, trigrams, etc.).
> >> > >              The default here is to assume you don't have any
> synonyms
> >> > > longer than 4 tokens.
> >> > >              You can tweak this depending on what your synonyms look
> >> > like.
> >> > > E.g. if you only have unigrams, you can remove
> >> > >              it entirely, and if your synonyms are up to 7 tokens in
> >> > > length, you should set the maxShingleSize to 7.
> >> > >           -->
> >> > >         <lst name="filter">
> >> > >           <str name="class">solr.ShingleFilterFactory</str>
> >> > >           <str name="outputUnigramsIfNoShingles">true</str>
> >> > >           <str name="outputUnigrams">true</str>
> >> > >           <str name="minShingleSize">2</str>
> >> > >           <str name="maxShingleSize">4</str>
> >> > >         </lst>
> >> > >         <!-- This is where you set your synonym file.  For the unit
> >> tests
> >> > > and "Getting Started" examples, we use example_synonym_file.txt.
> >> > >              This plugin will work best if you keep expand set to
> true
> >> > and
> >> > > have all your synonyms comma-separated (rather than =>-separated).
> >> > >           -->
> >> > >         <lst name="filter">
> >> > >           <str name="class">solr.SynonymFilterFactory</str>
> >> > >           <str name="tokenizerFactory">solr.
> >> > KeywordTokenizerFactory</str>
> >> > >           <str name="synonyms">example_synonym_file.txt</str>
> >> > >           <str name="expand">true</str>
> >> > >           <str name="ignoreCase">true</str>
> >> > >         </lst>
> >> > >       </lst>
> >> > >     </lst>
> >> > >   </queryParser>
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <
> >> > hossman_luc...@fucit.org
> >> > > > wrote:
> >> > >
> >> > >>
> >> > >> : First let me say that this is very possibly the "x - y problem"
> so
> >> let
> >> > >> me
> >> > >> : state up front what my ultimate need is -- then I'll ask about
> the
> >> > >> thing I
> >> > >> : imagine might help...  which, of course, is heavily biased in the
> >> > >> direction
> >> > >> : of my experience coding Java and writing SQL...
> >> > >>
> >> > >> Thank you so much for asking your question this way!
> >> > >>
> >> > >> Right off the bat, the background you've provided seems
> supicious...
> >> > >>
> >> > >> : I have a piece of a query that calculates a score based on a
> >> > "weighting"
> >> > >>         ...
> >> > >> : The specific line is this:
> >> > >> : <str name="bf">product(field(category_weight),20)</str>
> >> > >> :
> >> > >> : What I just realized is that when I query Solr for a string that
> >> has
> >> > NO
> >> > >> : matches in the entire corpus, I still get a slew of results
> because
> >> > >> EVERY
> >> > >> : doc has the weighting value in the category_weight field - and
> >> > therefore
> >> > >> : every doc gets some score.
> >> > >>
> >> > >> ...that is *NOT* how dismax and edisamx normally work.
> >> > >>
> >> > >> While both the "bf" abd "bq" params result in "additive" boosting,
> >> and
> >> > the
> >> > >> implementation of that "additive boost" comes from adding new
> >> optional
> >> > >> clauses to the top level BooleanQuery that is executed, that only
> >> > happens
> >> > >> after the "main" query (from your "q" param) is added to that top
> >> level
> >> > >> BooleanQuery as a "mandaory" clause.
> >> > >>
> >> > >> So, for example, "bf=true()" and "bq=*:*" should match & boost
> every
> >> > doc,
> >> > >> but with the techprducts configs/data these requests still don't
> >> match
> >> > >> anything...
> >> > >>
> >> > >> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
> >> > >> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query
> >> > >>
> >> > >> ...and if you look at the debug output, the parsed queries shows
> that
> >> > the
> >> > >> "bogus" part of the query is mandatory...
> >> > >>
> >> > >> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*)
> >> > >> FunctionQuery(const(true))
> >> > >>
> >> > >> (i didn't use "pf" in that example, but the effect is the same, the
> >> "pf"
> >> > >> based clauses are optional, while the "qf" based clauses are
> >> mandatory)
> >> > >>
> >> > >> If you compare that example to your debug output, you'll notice a
> >> > >> difference in structure -- it's a bit hard to see in your example,
> >> but
> >> > if
> >> > >> you simplify your qf, pf, and q fields it should be more obvious,
> but
> >> > >> AFAICT the "main" parts of your query are getting wrapped in an
> extra
> >> > >> layer of parents (ie: an extra BooleanQuery) which is *not*
> >> mandatory in
> >> > >> the top level query ... i don't see *any* mandatory clauses in your
> >> top
> >> > >> level BooleanQuery, which is why any match on a bf or bq function
> is
> >> > >> enough to cause a document to match.
> >> > >>
> >> > >> I suspect the reason your parsed query structure is so diff has to
> do
> >> > with
> >> > >> this...
> >> > >>
> >> > >> :        <str name="defType">synonym_edismax</str>>
> >> > >>
> >> > >>
> >> > >> 1) how exactly is "synonym_edismax" defined in your solrconfig.xml?
> >> > >> 2) what QParserPlugin are you using to implement that?
> >> > >>
> >> > >> I suspect whatever QParserPlugin you are using has a bug in it :)
> >> > >>
> >> > >>
> >> > >> If you can't fix the bug, one possibile workaround would be to
> >> abandon
> >> > bf
> >> > >> and bq params completely, and instead wrap the query it produces in
> >> in a
> >> > >> {!boost} parser with whatever function you want (using functions
> like
> >> > >> sum() or prod() to combine multiple functions, and query() to
> >> > incorporate
> >> > >> your current bq param).  Doing this will require chanign how you
> >> specify
> >> > >> you input (example below) and it will result in *multiplicitive*
> >> boosts
> >> > --
> >> > >> so your scores will be much diff, and you will likely have to
> adjust
> >> > your
> >> > >> constants, but: 1) multiplicitive boosts are almost always what
> >> people
> >> > >> *really* want anyway; 2) it will ensure the boosts are only applied
> >> for
> >> > >> things matching your main query, no matter how that query parser
> >> works
> >> > or
> >> > >> what bugs it has.
> >> > >>
> >> > >> Example of using {!boost} to wrap an arbitrary other parser...
> >> > >>
> >> > >> instead of...
> >> > >>   defType=foofoo
> >> > >>   q=barbarbar
> >> > >>
> >> > >> use...
> >> > >>    q={!boost b=$func defType=foofoo v=$qq}
> >> > >>   qq=barbarbar
> >> > >> func=sum(something,somethingelse)
> >> > >>
> >> > >> https://cwiki.apache.org/confluence/display/solr/Other+Parsers
> >> > >> https://cwiki.apache.org/confluence/display/solr/Function+Queries
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >> :
> >> > >> : What I would like is to return zero results if there is no match
> >> for
> >> > the
> >> > >> : querystring.  My collection is small enough that I don't care if
> >> the
> >> > >> actual
> >> > >> : calculation runs on each doc (although that's wasteful) -- I just
> >> > don't
> >> > >> : want to see results come back for zero matches to the querystring
> >> > >> :
> >> > >> : (The /select endpoint does this of course, but my custom endpoint
> >> > >> includes
> >> > >> : this "weighting" piece and therefore returns every doc in the
> >> corpus
> >> > >> : because they all have the weighting.
> >> > >> :
> >> > >> : ====================
> >> > >> : Enter my imagined solution...  The potential X-Y problem...
> >> > >> : ====================
> >> > >> :
> >> > >> : So - given that I come from a programming background, I
> immediately
> >> > >> start
> >> > >> : thinking of an if statement ...
> >> > >> :
> >> > >> :      if(some_score_for_the_primary_search_string) {
> >> > >> :           run_the_category_weight_calculation;
> >> > >> :      } else {
> >> > >> :           do_NOT_run_category_weight_calc;
> >> > >> :      }
> >> > >> :
> >> > >> :
> >> > >> : Another way of thinking of it would be something like the "WHERE"
> >> > >> clause in
> >> > >> : SQL...
> >> > >> :
> >> > >> :  run_category_weight_calculation WHERE "searchstring" is found
> in
> >> the
> >> > >> : document, not otherwise.
> >> > >> :
> >> > >> : I'm aware that things could be handled in the client-side of my
> web
> >> > app,
> >> > >> : but if possible, I'd like the interface to SOLR to be as clean as
> >> > >> possible,
> >> > >> : and massage incoming SOLR data as little as possible.
> >> > >> :
> >> > >> : In other words, do NOT return any docs if the querystring (and
> any
> >> > >> : synonyms) match zero docs.
> >> > >> :
> >> > >> : Here is the endpoint XML for the query.  I've highlighted the
> >> specific
> >> > >> line
> >> > >> : that is causing the unintended results...
> >> > >> :
> >> > >> :
> >> > >> :  <requestHandler name="/foo" class="solr.SearchHandler">
> >> > >> :     <!-- default values for query parameters can be specified,
> >> these
> >> > >> :          will be overridden by parameters in the request
> >> > >> :       -->
> >> > >> :      <lst name="defaults">
> >> > >> :        <str name="echoParams">all</str>
> >> > >> :        <int name="rows">20</int>
> >> > >> :        <!-- Query settings -->
> >> > >> :        <str name="df">text</str>
> >> > >> :       <!-- <str name="df">title</str> -->
> >> > >> :        <str name="defType">synonym_edismax</str>>
> >> > >> :        <str name="synonyms">true</str>
> >> > >> :     <!-- The line below balances out the weighting of exact
> >> matches to
> >> > >> the
> >> > >> : synonym phrase entered by the user
> >> > >> :          with the category_weight calculation and the titleQuery
> >> calc.
> >> > >> : These numbers exist in a balance and
> >> > >> :          if one is raised or lowered, the others (probably) need
> to
> >> > >> change
> >> > >> : as well.  It may be better to go with decimals
> >> > >> :          for all of them... .4 instead of 4 and 2 instead of 20
> and
> >> > 2.5
> >> > >> : instead of 25.
> >> > >> :          In the end, I'm not sure it really matters, but don't
> >> change
> >> > >> one
> >> > >> : without changing the others
> >> > >> :          unless you've tested and are sure you want the results
> >> -->
> >> > >> :        <float name="synonyms.originalBoost">1.5</float>
> >> > >> :        <float name="synonyms.synonymBoost">1.1</float>
> >> > >> :        <str name="mm">75%</str>
> >> > >> :        <str name="q.alt">*:*</str>
> >> > >> :        <str name="rows">20</str>
> >> > >> :        <str name="fq">meta_doc_type:chapterDoc</str>
> >> > >> :        <str name="bq">{!synonym_edismax qf='title'
> synonyms='true'
> >> > >> : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf=''
> >> bq=''
> >> > >> : v=$q}</str>
> >> > >> :        <str name="fl">id category_weight title category_ss score
> >> > >> : contentType</str>
> >> > >> :        <str name="titleQuery">{!edismax qf='title' bf='' bq=''
> >> > >> v=$q}</str>
> >> > >> : =====================================================
> >> > >> :        *<str name="bf">product(field(category_weight),20)</str>*
> >> > >> : =====================================================
> >> > >> :        <str name="bf">product(query($titleQuery),4)</str>
> >> > >> :        <str name="qf">text contentType^1000</str>
> >> > >> :        <str name="wt">python</str>
> >> > >> :        <str name="debug">true</str>
> >> > >> :        <str name="debug.explain.structured">true</str>
> >> > >> :        <str name="indent">true</str>
> >> > >> :        <str name="echoParams">all</str>
> >> > >> :      </lst>
> >> > >> :   </requestHandler>
> >> > >> :
> >> > >> : And here is the debug output for a query.  (This was a test for
> >> > >> synonyms,
> >> > >> : which you'll see in the output.) The original query string was,
> of
> >> > >> : course, "μ-heavy
> >> > >> : chain disease"
> >> > >> :
> >> > >> : You'll note that although there is no score in the first doc
> >> explain
> >> > for
> >> > >> : the actual querystring, the highlighted section does get a score
> >> for
> >> > >> : product(double(category_weight)=1.5,const(20))
> >> > >> :
> >> > >> : ... which is the thing that is currently causing all the docs in
> >> the
> >> > >> : collection to "match" even though the querystring is not in any
> of
> >> > them.
> >> > >> :
> >> > >> : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"",
> >> > >> : "querystring":"\"μ-heavy
> >> > >> : chain disease\"", "parsedquery":"(DisjunctionMaxQuery((text:\"μ
> >> heavy
> >> > >> chain
> >> > >> : disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
> >> > >> : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" |
> >> > >> (contentType:\"mu
> >> > >> : heavy chain disease\")^1000.0)))/no_coord^1.1)
> >> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> >> > >> : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\"μ
> >> heavy
> >> > >> chain
> >> > >> : disease\" | (contentType:\"μ heavy chain
> >> > disease\")^1000.0)))/no_coord^
> >> > >> 1.1)
> >> > >> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> >> > >> : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\"μ
> >> > heavy
> >> > >> : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy
> >> chain
> >> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> >> > >> : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy
> >> chain
> >> > >> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> >> > >> : hcd\")))/no_coord^1.1)))
> >> > >> : FunctionQuery(product(double(category_weight),const(20)))
> >> > >> : FunctionQuery(product(query(+(title:\"μ heavy chain
> >> > >> : disease\"),def=0.0),const(4)))", "parsedquery_toString":"(((tex
> >> t:\"μ
> >> > >> heavy
> >> > >> : chain disease\" | (contentType:\"μ heavy chain
> >> disease\")^1000.0))^1.5
> >> > >> : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy
> chain
> >> > >> : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
> >> > >> : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" |
> >> > >> (contentType:\"μ
> >> > >> : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" |
> >> > >> (contentType:\"μ
> >> > >> : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5
> >> > >> : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ
> >> hcd\"))^1.1)
> >> > >> : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ
> >> > hcd\"))^1.1)))
> >> > >> : product(double(category_weight),const(20))
> >> product(query(+(title:\"μ
> >> > >> heavy
> >> > >> : chain disease\"),def=0.0),const(4))", "explain":{ "
> >> > >> : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true,
> >> "value":30.0, "
> >> > >> : description":"sum of:", "details":[{ "match":true, "value":30.0,
> "
> >> > >> : description":"FunctionQuery(product(double(category_weight),
> >> > >> const(20))),
> >> > >> : product of:",
> >> > >> : =====================================================
> >> > >> : *"details":**[{ "match":true, "value":30.0,
> >> > >> : "description":"product(double(category_weight)=1.5,const(20))"},
> >> {*
> >> > >> : =====================================================
> >> > >> :
> >> > >> : "match":true, "value":1.0, "description":"boost"}, {
> "match":true,
> >> > >> "value":
> >> > >> : 1.0, "description":"queryNorm"}]}, {
> >> > >> :
> >> > >>
> >> > >> -Hoss
> >> > >> http://www.lucidworks.com/
> >> > >
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Want zero results from SOLR when there are no matches for "querystring"

Reply via email to