Questions about filters and scoring

Reece Mon, 18 Feb 2008 12:57:31 -0800

Hello Everyone,

First off, sorry about the thread hijack earlier, it was not intentional.


Back to the point though, I'm having some issues getting
SOLR to work with our dataset.  I'm using it to index ticket data for
our technical support department.  Below are a few of the problems
I've been having, and the wiki hasn't had much to say about them.

1) As an example, searching for "binarydata_groupdocument_fk" returns
nothing, while searching for "BinaryData_GroupDocument_FK" returns
results.  I have the lowercasefilterfactory applied to both the index
and query analyzers.  Does this not actually set everything to lower
case?  From the wiki at
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters, it says
"Creates tokens by lowercasing all letters and dropping non-letters"
but that does not seem to be happening here.  Am I forgetting to
configure something?

2) Some of our data is one sentence.  Some is over 5 MB of text.  When
searching for a term, it's returning the one sentence data first
because the fieldNorm is so different (0.4 for one, 0.002 for others).
 Is there a way to disable using the fieldnorm in the score
calculation?  An alternative I tried was posting parts of the data in
as different values of the field (so having multiple tags of that
field-name in the add xml post), but that appeared to have zero effect
on the results - even the querydebugger showed the exact same
calculation for the search.  Does anyone know how to disable the
fieldnorm, or have the score created from adding the scores from each
value of a multivalued field?

3) I discovered that searching for '"certificate not found"' (using
the double quotes for a phrase here) did not return any results, even
though the phrase did exist (and was lower case originally too, so
different than my first issue).  I discovered it was because of the
stopword "not", but the same stopfilterfactory was applied to both the
index and query analyzers.  Am I doing something wrong there?  As a
workaround I'm having php manually removing stopwords from the
querystring, which is a real pain.  I'm thinking my filters aren't being
applied correctly since this is similar to issue #1 but with a different
filter.

Here is my fieldtype I do the actual searches on:

   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
     <analyzer type="index">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <!-- in this example, we will only use synonyms at query time
       <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
       -->
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
       <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
       <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
   </fieldType>

Any help or advice would be greatly appreciated, thanks!

-Reece

Questions about filters and scoring

Reply via email to