Re: Want zero results from SOLR when there are no matches for "querystring"

John Bickerstaff Thu, 11 Aug 2016 17:19:53 -0700

Thanks!

To answer your questions, while I digest the rest of that information...


I'm using the hon-lucene-synonyms.5.0.4.jar from here:
https://github.com/healthonnet/hon-lucene-synonyms

The config looks like this - and IIRC, is simply a copy from the
recommended cofig on the site mentioned above.

 <queryParser name="synonym_edismax"
class="com.github.healthonnet.search.SynonymExpandingExtendedDismaxQParserPlugin">
    <!-- You can define more than one synonym analyzer in the following
list.
         For example, you might have one set of synonyms for English, one
for French,
         one for Spanish, etc.
      -->
    <lst name="synonymAnalyzers">
      <!-- Name your analyzer something useful, e.g. "analyzer_en",
"analyzer_fr", "analyzer_es", etc.
           If you only have one, the name doesn't matter (hence
"myCoolAnalyzer").
        -->
      <lst name="myCoolAnalyzer">
        <!-- We recommend a PatternTokenizerFactory that tokenizes based on
whitespace and quotes.
             This seems to work best with most people's synonym files.
             For details, read the discussion here:
http://github.com/healthonnet/hon-lucene-synonyms/issues/26
          -->
        <lst name="tokenizer">
          <str name="class">solr.PatternTokenizerFactory</str>
          <str name="pattern"><![CDATA[(?:\s|\")+]]></str>
        </lst>
        <!-- The ShingleFilterFactory outputs synonyms of multiple token
lengths (e.g. unigrams, bigrams, trigrams, etc.).
             The default here is to assume you don't have any synonyms
longer than 4 tokens.
             You can tweak this depending on what your synonyms look like.
E.g. if you only have unigrams, you can remove
             it entirely, and if your synonyms are up to 7 tokens in
length, you should set the maxShingleSize to 7.
          -->
        <lst name="filter">
          <str name="class">solr.ShingleFilterFactory</str>
          <str name="outputUnigramsIfNoShingles">true</str>
          <str name="outputUnigrams">true</str>
          <str name="minShingleSize">2</str>
          <str name="maxShingleSize">4</str>
        </lst>
        <!-- This is where you set your synonym file.  For the unit tests
and "Getting Started" examples, we use example_synonym_file.txt.
             This plugin will work best if you keep expand set to true and
have all your synonyms comma-separated (rather than =>-separated).
          -->
        <lst name="filter">
          <str name="class">solr.SynonymFilterFactory</str>
          <str name="tokenizerFactory">solr.KeywordTokenizerFactory</str>
          <str name="synonyms">example_synonym_file.txt</str>
          <str name="expand">true</str>
          <str name="ignoreCase">true</str>
        </lst>
      </lst>
    </lst>
  </queryParser>



On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <hossman_luc...@fucit.org>
wrote:

>
> : First let me say that this is very possibly the "x - y problem" so let me
> : state up front what my ultimate need is -- then I'll ask about the thing
> I
> : imagine might help...  which, of course, is heavily biased in the
> direction
> : of my experience coding Java and writing SQL...
>
> Thank you so much for asking your question this way!
>
> Right off the bat, the background you've provided seems supicious...
>
> : I have a piece of a query that calculates a score based on a "weighting"
>         ...
> : The specific line is this:
> : <str name="bf">product(field(category_weight),20)</str>
> :
> : What I just realized is that when I query Solr for a string that has NO
> : matches in the entire corpus, I still get a slew of results because EVERY
> : doc has the weighting value in the category_weight field - and therefore
> : every doc gets some score.
>
> ...that is *NOT* how dismax and edisamx normally work.
>
> While both the "bf" abd "bq" params result in "additive" boosting, and the
> implementation of that "additive boost" comes from adding new optional
> clauses to the top level BooleanQuery that is executed, that only happens
> after the "main" query (from your "q" param) is added to that top level
> BooleanQuery as a "mandaory" clause.
>
> So, for example, "bf=true()" and "bq=*:*" should match & boost every doc,
> but with the techprducts configs/data these requests still don't match
> anything...
>
> /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query
> /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query
>
> ...and if you look at the debug output, the parsed queries shows that the
> "bogus" part of the query is mandatory...
>
> +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*)
> FunctionQuery(const(true))
>
> (i didn't use "pf" in that example, but the effect is the same, the "pf"
> based clauses are optional, while the "qf" based clauses are mandatory)
>
> If you compare that example to your debug output, you'll notice a
> difference in structure -- it's a bit hard to see in your example, but if
> you simplify your qf, pf, and q fields it should be more obvious, but
> AFAICT the "main" parts of your query are getting wrapped in an extra
> layer of parents (ie: an extra BooleanQuery) which is *not* mandatory in
> the top level query ... i don't see *any* mandatory clauses in your top
> level BooleanQuery, which is why any match on a bf or bq function is
> enough to cause a document to match.
>
> I suspect the reason your parsed query structure is so diff has to do with
> this...
>
> :        <str name="defType">synonym_edismax</str>>
>
>
> 1) how exactly is "synonym_edismax" defined in your solrconfig.xml?
> 2) what QParserPlugin are you using to implement that?
>
> I suspect whatever QParserPlugin you are using has a bug in it :)
>
>
> If you can't fix the bug, one possibile workaround would be to abandon bf
> and bq params completely, and instead wrap the query it produces in in a
> {!boost} parser with whatever function you want (using functions like
> sum() or prod() to combine multiple functions, and query() to incorporate
> your current bq param).  Doing this will require chanign how you specify
> you input (example below) and it will result in *multiplicitive* boosts --
> so your scores will be much diff, and you will likely have to adjust your
> constants, but: 1) multiplicitive boosts are almost always what people
> *really* want anyway; 2) it will ensure the boosts are only applied for
> things matching your main query, no matter how that query parser works or
> what bugs it has.
>
> Example of using {!boost} to wrap an arbitrary other parser...
>
> instead of...
>   defType=foofoo
>   q=barbarbar
>
> use...
>    q={!boost b=$func defType=foofoo v=$qq}
>   qq=barbarbar
> func=sum(something,somethingelse)
>
> https://cwiki.apache.org/confluence/display/solr/Other+Parsers
> https://cwiki.apache.org/confluence/display/solr/Function+Queries
>
>
>
>
> :
> : What I would like is to return zero results if there is no match for the
> : querystring.  My collection is small enough that I don't care if the
> actual
> : calculation runs on each doc (although that's wasteful) -- I just don't
> : want to see results come back for zero matches to the querystring
> :
> : (The /select endpoint does this of course, but my custom endpoint
> includes
> : this "weighting" piece and therefore returns every doc in the corpus
> : because they all have the weighting.
> :
> : ====================
> : Enter my imagined solution...  The potential X-Y problem...
> : ====================
> :
> : So - given that I come from a programming background, I immediately start
> : thinking of an if statement ...
> :
> :      if(some_score_for_the_primary_search_string) {
> :           run_the_category_weight_calculation;
> :      } else {
> :           do_NOT_run_category_weight_calc;
> :      }
> :
> :
> : Another way of thinking of it would be something like the "WHERE" clause
> in
> : SQL...
> :
> :  run_category_weight_calculation WHERE "searchstring" is found in the
> : document, not otherwise.
> :
> : I'm aware that things could be handled in the client-side of my web app,
> : but if possible, I'd like the interface to SOLR to be as clean as
> possible,
> : and massage incoming SOLR data as little as possible.
> :
> : In other words, do NOT return any docs if the querystring (and any
> : synonyms) match zero docs.
> :
> : Here is the endpoint XML for the query.  I've highlighted the specific
> line
> : that is causing the unintended results...
> :
> :
> :  <requestHandler name="/foo" class="solr.SearchHandler">
> :     <!-- default values for query parameters can be specified, these
> :          will be overridden by parameters in the request
> :       -->
> :      <lst name="defaults">
> :        <str name="echoParams">all</str>
> :        <int name="rows">20</int>
> :        <!-- Query settings -->
> :        <str name="df">text</str>
> :       <!-- <str name="df">title</str> -->
> :        <str name="defType">synonym_edismax</str>>
> :        <str name="synonyms">true</str>
> :     <!-- The line below balances out the weighting of exact matches to
> the
> : synonym phrase entered by the user
> :          with the category_weight calculation and the titleQuery calc.
> : These numbers exist in a balance and
> :          if one is raised or lowered, the others (probably) need to
> change
> : as well.  It may be better to go with decimals
> :          for all of them... .4 instead of 4 and 2 instead of 20 and 2.5
> : instead of 25.
> :          In the end, I'm not sure it really matters, but don't change one
> : without changing the others
> :          unless you've tested and are sure you want the results  -->
> :        <float name="synonyms.originalBoost">1.5</float>
> :        <float name="synonyms.synonymBoost">1.1</float>
> :        <str name="mm">75%</str>
> :        <str name="q.alt">*:*</str>
> :        <str name="rows">20</str>
> :        <str name="fq">meta_doc_type:chapterDoc</str>
> :        <str name="bq">{!synonym_edismax qf='title' synonyms='true'
> : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq=''
> : v=$q}</str>
> :        <str name="fl">id category_weight title category_ss score
> : contentType</str>
> :        <str name="titleQuery">{!edismax qf='title' bf='' bq=''
> v=$q}</str>
> : =====================================================
> :        *<str name="bf">product(field(category_weight),20)</str>*
> : =====================================================
> :        <str name="bf">product(query($titleQuery),4)</str>
> :        <str name="qf">text contentType^1000</str>
> :        <str name="wt">python</str>
> :        <str name="debug">true</str>
> :        <str name="debug.explain.structured">true</str>
> :        <str name="indent">true</str>
> :        <str name="echoParams">all</str>
> :      </lst>
> :   </requestHandler>
> :
> : And here is the debug output for a query.  (This was a test for synonyms,
> : which you'll see in the output.) The original query string was, of
> : course, "μ-heavy
> : chain disease"
> :
> : You'll note that although there is no score in the first doc explain for
> : the actual querystring, the highlighted section does get a score for
> : product(double(category_weight)=1.5,const(20))
> :
> : ... which is the thing that is currently causing all the docs in the
> : collection to "match" even though the querystring is not in any of them.
> :
> : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"",
> : "querystring":"\"μ-heavy
> : chain disease\"", "parsedquery":"(DisjunctionMaxQuery((text:\"μ heavy
> chain
> : disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
> : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" |
> (contentType:\"mu
> : heavy chain disease\")^1000.0)))/no_coord^1.1)
> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\"μ heavy
> chain
> : disease\" | (contentType:\"μ heavy chain disease\")^1000.0)))/no_coord^
> 1.1)
> : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ
> : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\"μ heavy
> : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy chain
> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy chain
> : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ
> : hcd\")))/no_coord^1.1)))
> : FunctionQuery(product(double(category_weight),const(20)))
> : FunctionQuery(product(query(+(title:\"μ heavy chain
> : disease\"),def=0.0),const(4)))", "parsedquery_toString":"(((text:\"μ
> heavy
> : chain disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5
> : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy chain
> : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ
> : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" |
> (contentType:\"μ
> : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" |
> (contentType:\"μ
> : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5
> : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ hcd\"))^1.1)
> : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ hcd\"))^1.1)))
> : product(double(category_weight),const(20)) product(query(+(title:\"μ
> heavy
> : chain disease\"),def=0.0),const(4))", "explain":{ "
> : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true, "value":30.0, "
> : description":"sum of:", "details":[{ "match":true, "value":30.0, "
> : description":"FunctionQuery(product(double(category_weight),const(20))),
> : product of:",
> : =====================================================
> : *"details":**[{ "match":true, "value":30.0,
> : "description":"product(double(category_weight)=1.5,const(20))"}, {*
> : =====================================================
> :
> : "match":true, "value":1.0, "description":"boost"}, { "match":true,
> "value":
> : 1.0, "description":"queryNorm"}]}, {
> :
>
> -Hoss
> http://www.lucidworks.com/

Re: Want zero results from SOLR when there are no matches for "querystring"

Reply via email to