sow defaulting to false changed between 6.x and 7.x, which is why the
problem has appeared for you, and is solved by setting sow=true in your
defaults.
With sow=true, I would expect your query to be broken into three parts,
and then tokenised:
ABC4856.21
AND
-field1:ABC4856.21
With sow=false, the whole query will be tokenised in one go, so one of
the query analysers on the fields being searched is behaving differently
depending on the string passed.
Does the parsed query (in the debugQuery=true output) give any
indication of the differences between the two versions? What analysis is
done on the fields being queried?
Thanks,
Matt
On 01/02/2019 16:55, Oakley, Craig (NIH/NLM/NCBI) [C] wrote:
We had a problem when upgrading from Solr 6.6 to Solr 7.4 in that a query
ceased to work.
The query was of the form
http://localhost:8983/solr/collection/select?indent=on&q=ABC4856.21%20AND%20-field1:ABC4856.21&wt=json&rows=0
Basically finding a count of those records where there is some field which has
"ABC4856.21", but where the field field1 does not have that string (in other words, where
there is some field other than field1 which has "ABC4856.21")
For this particular collection, running the query against Solr 6.6 resulted in "response":{"numFound":0"
(which was correct), but running it against Solr 7.4 resulted in ""response":{"numFound":21322074"
After some investigation, it seemed to be a problem with the initial "ABC4856.21" being tokenized
as "ABC4856" and "21"
We found various work-arounds such as putting quotation marks around the string or adding
"*:" after the "q="; but the user wanted the exact same query to work in Solr
7.4 as it had in Solr 6.6
Eventually, we found a solution by adding "<str name="sow">true</str>" to the Select
handler in solrconfig.xml (for "Separate On Whitespace").
This solution seems to be sufficient; but we would like to be sure we
understand the solution.
Looking at lucene.apache.org/solr/guide/7_4/tokenizers.html#standard-tokenizer
it would seem that the period should not split the string into two tokens.
Could someone clarify how we can know which Tokenize is used when, and which
definition of White Space is used when?
Thanks
--
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk