Phrase query no hits when stopwords and FlattenGraphFilterFactory used

Edward Turner Fri, 06 Nov 2020 05:59:08 -0800

Hi all,

We are experiencing some unexpected behaviour for phrase queries which we
believe might be related to the FlattenGraphFilterFactory and stopwords.


Brief description: when performing a phrase query
"Molecular cloning and evolution of the" => we get expected hits
"Molecular cloning and evolution of the genes" => we get no hits
(unexpected behaviour)

I think it's worthwhile adding the analyzers we use to help you see what
we're doing:
------------ Analyzers ----------------
<fieldType name="full_ci" class="solr.TextField"
   sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
   <analyzer type="index">
      <tokenizer class="solr.SimplePatternSplitTokenizerFactory"
         pattern="[- /()]+" />
      <filter class="solr.StopFilterFactory" words="stopwords.txt"
         ignoreCase="true" />
      <filter class="solr.ASCIIFoldingFilterFactory"
         preserveOriginal="false" />
      <filter class="solr.LowerCaseFilterFactory" />
      <filter class="solr.WordDelimiterGraphFilterFactory"
         generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
         splitOnNumerics="0" stemEnglishPossessive="1" generateWordParts="1"
         catenateNumbers="0" catenateWords="1" catenateAll="1" />
      <filter class="solr.FlattenGraphFilterFactory" />
   </analyzer>
   <analyzer type="query">
      <tokenizer class="solr.SimplePatternSplitTokenizerFactory"
         pattern="[- /()]+" />
      <filter class="solr.StopFilterFactory" words="stopwords.txt"
         ignoreCase="true" />
      <filter class="solr.ASCIIFoldingFilterFactory"
         preserveOriginal="false" />
      <filter class="solr.LowerCaseFilterFactory" />
      <filter class="solr.WordDelimiterGraphFilterFactory"
         generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
         splitOnNumerics="0" stemEnglishPossessive="1" generateWordParts="1"
         catenateNumbers="0" catenateWords="0" catenateAll="0" />
   </analyzer>
</fieldType>
------------ End of Analyzers ----------------

------------ Stopwords ----------------
We use the following stopwords:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
of, on, or, such, that, the, their, then, there, these, they, this, to,
was, will, with, which
------------ End of Stopwords ----------------

------------ Analysis Admin page output ---------------
... And to see what's going on when we're indexing/querying, I created a
gist with an image of the (non-verbose) output of the analysis admin page
for, index data/query, "Molecular cloning and evolution of the genes":
https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png

Hopefully this link works, and you can see that the resulting terms and
positions are identical until the FlattenGraphFilterFactory step in the
"index" phase.

Final stage of index analysis:
(1)molecular (2)cloning (3) (4)evolution (5) (6)genes

Final stage of query analysis:
(1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes

The empty positions are because of stopwords (presumably)
------------ End of Analysis Admin page output ---------------

Main question:
Could someone explain why the FlattenGraphFilterFactory changes the
position of the "genes" token? From what we see, this happens after a,
"the" (but we've not checked exhaustively, and continue to test).

Perhaps, we are doing something wrong in our analysis setup?

Any help would be much appreciated -- getting phrase queries to work is an
important use-case of ours.

Kind regards and thank you in advance,
Edd
--------------------
Edward Turner

Phrase query no hits when stopwords and FlattenGraphFilterFactory used

Reply via email to