Thanks! To answer your questions, while I digest the rest of that information...
I'm using the hon-lucene-synonyms.5.0.4.jar from here: https://github.com/healthonnet/hon-lucene-synonyms The config looks like this - and IIRC, is simply a copy from the recommended cofig on the site mentioned above. <queryParser name="synonym_edismax" class="com.github.healthonnet.search.SynonymExpandingExtendedDismaxQParserPlugin"> <!-- You can define more than one synonym analyzer in the following list. For example, you might have one set of synonyms for English, one for French, one for Spanish, etc. --> <lst name="synonymAnalyzers"> <!-- Name your analyzer something useful, e.g. "analyzer_en", "analyzer_fr", "analyzer_es", etc. If you only have one, the name doesn't matter (hence "myCoolAnalyzer"). --> <lst name="myCoolAnalyzer"> <!-- We recommend a PatternTokenizerFactory that tokenizes based on whitespace and quotes. This seems to work best with most people's synonym files. For details, read the discussion here: http://github.com/healthonnet/hon-lucene-synonyms/issues/26 --> <lst name="tokenizer"> <str name="class">solr.PatternTokenizerFactory</str> <str name="pattern"><![CDATA[(?:\s|\")+]]></str> </lst> <!-- The ShingleFilterFactory outputs synonyms of multiple token lengths (e.g. unigrams, bigrams, trigrams, etc.). The default here is to assume you don't have any synonyms longer than 4 tokens. You can tweak this depending on what your synonyms look like. E.g. if you only have unigrams, you can remove it entirely, and if your synonyms are up to 7 tokens in length, you should set the maxShingleSize to 7. --> <lst name="filter"> <str name="class">solr.ShingleFilterFactory</str> <str name="outputUnigramsIfNoShingles">true</str> <str name="outputUnigrams">true</str> <str name="minShingleSize">2</str> <str name="maxShingleSize">4</str> </lst> <!-- This is where you set your synonym file. For the unit tests and "Getting Started" examples, we use example_synonym_file.txt. This plugin will work best if you keep expand set to true and have all your synonyms comma-separated (rather than =>-separated). --> <lst name="filter"> <str name="class">solr.SynonymFilterFactory</str> <str name="tokenizerFactory">solr.KeywordTokenizerFactory</str> <str name="synonyms">example_synonym_file.txt</str> <str name="expand">true</str> <str name="ignoreCase">true</str> </lst> </lst> </lst> </queryParser> On Thu, Aug 11, 2016 at 6:01 PM, Chris Hostetter <hossman_luc...@fucit.org> wrote: > > : First let me say that this is very possibly the "x - y problem" so let me > : state up front what my ultimate need is -- then I'll ask about the thing > I > : imagine might help... which, of course, is heavily biased in the > direction > : of my experience coding Java and writing SQL... > > Thank you so much for asking your question this way! > > Right off the bat, the background you've provided seems supicious... > > : I have a piece of a query that calculates a score based on a "weighting" > ... > : The specific line is this: > : <str name="bf">product(field(category_weight),20)</str> > : > : What I just realized is that when I query Solr for a string that has NO > : matches in the entire corpus, I still get a slew of results because EVERY > : doc has the weighting value in the category_weight field - and therefore > : every doc gets some score. > > ...that is *NOT* how dismax and edisamx normally work. > > While both the "bf" abd "bq" params result in "additive" boosting, and the > implementation of that "additive boost" comes from adding new optional > clauses to the top level BooleanQuery that is executed, that only happens > after the "main" query (from your "q" param) is added to that top level > BooleanQuery as a "mandaory" clause. > > So, for example, "bf=true()" and "bq=*:*" should match & boost every doc, > but with the techprducts configs/data these requests still don't match > anything... > > /select?defType=edismax&q=bogus&bf=true()&bq=*:*&debug=query > /select?defType=dismax&q=bogus&bf=true()&bq=*:*&debug=query > > ...and if you look at the debug output, the parsed queries shows that the > "bogus" part of the query is mandatory... > > +DisjunctionMaxQuery((text:bogus)) MatchAllDocsQuery(*:*) > FunctionQuery(const(true)) > > (i didn't use "pf" in that example, but the effect is the same, the "pf" > based clauses are optional, while the "qf" based clauses are mandatory) > > If you compare that example to your debug output, you'll notice a > difference in structure -- it's a bit hard to see in your example, but if > you simplify your qf, pf, and q fields it should be more obvious, but > AFAICT the "main" parts of your query are getting wrapped in an extra > layer of parents (ie: an extra BooleanQuery) which is *not* mandatory in > the top level query ... i don't see *any* mandatory clauses in your top > level BooleanQuery, which is why any match on a bf or bq function is > enough to cause a document to match. > > I suspect the reason your parsed query structure is so diff has to do with > this... > > : <str name="defType">synonym_edismax</str>> > > > 1) how exactly is "synonym_edismax" defined in your solrconfig.xml? > 2) what QParserPlugin are you using to implement that? > > I suspect whatever QParserPlugin you are using has a bug in it :) > > > If you can't fix the bug, one possibile workaround would be to abandon bf > and bq params completely, and instead wrap the query it produces in in a > {!boost} parser with whatever function you want (using functions like > sum() or prod() to combine multiple functions, and query() to incorporate > your current bq param). Doing this will require chanign how you specify > you input (example below) and it will result in *multiplicitive* boosts -- > so your scores will be much diff, and you will likely have to adjust your > constants, but: 1) multiplicitive boosts are almost always what people > *really* want anyway; 2) it will ensure the boosts are only applied for > things matching your main query, no matter how that query parser works or > what bugs it has. > > Example of using {!boost} to wrap an arbitrary other parser... > > instead of... > defType=foofoo > q=barbarbar > > use... > q={!boost b=$func defType=foofoo v=$qq} > qq=barbarbar > func=sum(something,somethingelse) > > https://cwiki.apache.org/confluence/display/solr/Other+Parsers > https://cwiki.apache.org/confluence/display/solr/Function+Queries > > > > > : > : What I would like is to return zero results if there is no match for the > : querystring. My collection is small enough that I don't care if the > actual > : calculation runs on each doc (although that's wasteful) -- I just don't > : want to see results come back for zero matches to the querystring > : > : (The /select endpoint does this of course, but my custom endpoint > includes > : this "weighting" piece and therefore returns every doc in the corpus > : because they all have the weighting. > : > : ==================== > : Enter my imagined solution... The potential X-Y problem... > : ==================== > : > : So - given that I come from a programming background, I immediately start > : thinking of an if statement ... > : > : if(some_score_for_the_primary_search_string) { > : run_the_category_weight_calculation; > : } else { > : do_NOT_run_category_weight_calc; > : } > : > : > : Another way of thinking of it would be something like the "WHERE" clause > in > : SQL... > : > : run_category_weight_calculation WHERE "searchstring" is found in the > : document, not otherwise. > : > : I'm aware that things could be handled in the client-side of my web app, > : but if possible, I'd like the interface to SOLR to be as clean as > possible, > : and massage incoming SOLR data as little as possible. > : > : In other words, do NOT return any docs if the querystring (and any > : synonyms) match zero docs. > : > : Here is the endpoint XML for the query. I've highlighted the specific > line > : that is causing the unintended results... > : > : > : <requestHandler name="/foo" class="solr.SearchHandler"> > : <!-- default values for query parameters can be specified, these > : will be overridden by parameters in the request > : --> > : <lst name="defaults"> > : <str name="echoParams">all</str> > : <int name="rows">20</int> > : <!-- Query settings --> > : <str name="df">text</str> > : <!-- <str name="df">title</str> --> > : <str name="defType">synonym_edismax</str>> > : <str name="synonyms">true</str> > : <!-- The line below balances out the weighting of exact matches to > the > : synonym phrase entered by the user > : with the category_weight calculation and the titleQuery calc. > : These numbers exist in a balance and > : if one is raised or lowered, the others (probably) need to > change > : as well. It may be better to go with decimals > : for all of them... .4 instead of 4 and 2 instead of 20 and 2.5 > : instead of 25. > : In the end, I'm not sure it really matters, but don't change one > : without changing the others > : unless you've tested and are sure you want the results --> > : <float name="synonyms.originalBoost">1.5</float> > : <float name="synonyms.synonymBoost">1.1</float> > : <str name="mm">75%</str> > : <str name="q.alt">*:*</str> > : <str name="rows">20</str> > : <str name="fq">meta_doc_type:chapterDoc</str> > : <str name="bq">{!synonym_edismax qf='title' synonyms='true' > : synonyms.originalBoost='2.5' synonyms.synonymBoost='1.1' bf='' bq='' > : v=$q}</str> > : <str name="fl">id category_weight title category_ss score > : contentType</str> > : <str name="titleQuery">{!edismax qf='title' bf='' bq='' > v=$q}</str> > : ===================================================== > : *<str name="bf">product(field(category_weight),20)</str>* > : ===================================================== > : <str name="bf">product(query($titleQuery),4)</str> > : <str name="qf">text contentType^1000</str> > : <str name="wt">python</str> > : <str name="debug">true</str> > : <str name="debug.explain.structured">true</str> > : <str name="indent">true</str> > : <str name="echoParams">all</str> > : </lst> > : </requestHandler> > : > : And here is the debug output for a query. (This was a test for synonyms, > : which you'll see in the output.) The original query string was, of > : course, "μ-heavy > : chain disease" > : > : You'll note that although there is no score in the first doc explain for > : the actual querystring, the highlighted section does get a score for > : product(double(category_weight)=1.5,const(20)) > : > : ... which is the thing that is currently causing all the docs in the > : collection to "match" even though the querystring is not in any of them. > : > : "debug":{ "rawquerystring":"\"μ-heavy chain disease\"", > : "querystring":"\"μ-heavy > : chain disease\"", "parsedquery":"(DisjunctionMaxQuery((text:\"μ heavy > chain > : disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5 > : ((+DisjunctionMaxQuery((text:\"mu heavy chain disease\" | > (contentType:\"mu > : heavy chain disease\")^1000.0)))/no_coord^1.1) > : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ > : hcd\")^1000.0)))/no_coord^1.1) ((+DisjunctionMaxQuery((text:\"μ heavy > chain > : disease\" | (contentType:\"μ heavy chain disease\")^1000.0)))/no_coord^ > 1.1) > : ((+DisjunctionMaxQuery((text:\"μ hcd\" | (contentType:\"μ > : hcd\")^1000.0)))/no_coord^1.1)) ((DisjunctionMaxQuery((title:\"μ heavy > : chain disease\"))^2.5 ((+DisjunctionMaxQuery((title:\"mu heavy chain > : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ > : hcd\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ heavy chain > : disease\")))/no_coord^1.1) ((+DisjunctionMaxQuery((title:\"μ > : hcd\")))/no_coord^1.1))) > : FunctionQuery(product(double(category_weight),const(20))) > : FunctionQuery(product(query(+(title:\"μ heavy chain > : disease\"),def=0.0),const(4)))", "parsedquery_toString":"(((text:\"μ > heavy > : chain disease\" | (contentType:\"μ heavy chain disease\")^1000.0))^1.5 > : ((+(text:\"mu heavy chain disease\" | (contentType:\"mu heavy chain > : disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | (contentType:\"μ > : hcd\")^1000.0))^1.1) ((+(text:\"μ heavy chain disease\" | > (contentType:\"μ > : heavy chain disease\")^1000.0))^1.1) ((+(text:\"μ hcd\" | > (contentType:\"μ > : hcd\")^1000.0))^1.1)) ((((title:\"μ heavy chain disease\"))^2.5 > : ((+(title:\"mu heavy chain disease\"))^1.1) ((+(title:\"μ hcd\"))^1.1) > : ((+(title:\"μ heavy chain disease\"))^1.1) ((+(title:\"μ hcd\"))^1.1))) > : product(double(category_weight),const(20)) product(query(+(title:\"μ > heavy > : chain disease\"),def=0.0),const(4))", "explain":{ " > : 33d808fe-6ccf-4305-a643-48e94de34d18":{ "match":true, "value":30.0, " > : description":"sum of:", "details":[{ "match":true, "value":30.0, " > : description":"FunctionQuery(product(double(category_weight),const(20))), > : product of:", > : ===================================================== > : *"details":**[{ "match":true, "value":30.0, > : "description":"product(double(category_weight)=1.5,const(20))"}, {* > : ===================================================== > : > : "match":true, "value":1.0, "description":"boost"}, { "match":true, > "value": > : 1.0, "description":"queryNorm"}]}, { > : > > -Hoss > http://www.lucidworks.com/