A question on WordDelimiterFilterFactory

yandong yao Tue, 14 Sep 2010 02:40:58 -0700

Hi Guys,

I encountered a problem when enabling WordDelimiterFilterFactory for both
index and query (pasted relative part of schema.xml at the bottom of email).


*1. Steps to reproduce:*
    1.1 The indexed sample document contains only one sentence: "This is a
TechNote."
    1.2 Query is: q=TechNote
    1.3  Result: no matches return, while the above sentence contains word
'TechNote' absolutely.

*
2. Output when enabling debugQuery*
By turning on debugQuery
http://localhost:7111/solr/test/select?indent=on&version=2.2&q=TechNote&fq=&start=0&rows=0&fl=*%2Cscore&qt=standard&wt=standard&debugQuery=on&explainOther=id%3A001&hl.fl=,
get following information:

<str name="rawquerystring">TechNote</str>
<str name="querystring">TechNote</str>
<str name="parsedquery">PhraseQuery(all:"tech note")</str>
<str name="parsedquery_toString">all:"tech note"</str>
<lst name="explain"/>
<str name="otherQuery">id:001</str>
<lst name="explainOther">
<str name="001">
0.0 = fieldWeight(all:"tech note" in 0), product of: 0.0 =
tf(phraseFreq=0.0)
  0.61370564 = idf(all: tech=1 note=1)
  0.25 = fieldNorm(field=all, doc=0)
</str>
</lst>

Seems that the raw query string is converted to phrase query "tech note",
while its term frequency is 0, so no matches.

*3. Result from admin/analysis.jsp page*

>From analysis.jsp, seems the query 'TechNote' matches the input document,
see below words marked by RED color.

Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}  term
position 1234 term text ThisisaTechNote. term type wordwordwordword source
start,end 0,45,78,910,19 payload



 org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=true}  term position 1234 term text
ThisisaTechNote. term
type wordwordwordword source start,end 0,45,78,910,19 payload



 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=0,
catenateNumbers=1}  term position 12345 term text ThisisaTechNote TechNote term
type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18
payload





 org.apache.solr.analysis.LowerCaseFilterFactory {}  term position 12345 term
text thisisatechnote technote term type wordwordwordwordword word source
start,end 0,45,78,910,1414,18 10,18 payload





 org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English}  term position 12345 term text
thisisa*tech**note* technot term type wordwordwordwordword word source
start,end 0,45,78,910,1414,18 10,18 payload





 Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {}  term
position 1 term text TechNote term type word source start,end 0,8 payload
 org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=true}  term position 1 term text TechNote term type
word source start,end 0,8 payload
 org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=0, generateWordParts=1, catenateAll=0,
catenateNumbers=0}  term position 12 term text TechNote term type
wordword source
start,end 0,44,8 payload

 org.apache.solr.analysis.LowerCaseFilterFactory {}  term position 12 term
text technote term type wordword source start,end 0,44,8 payload

 org.apache.solr.analysis.SnowballPorterFilterFactory
{protected=protwords.txt, language=English} term position 12 term text tech
note term type wordword source start,end 0,44,8 payload


*
4. My questions are:*
    4.1: Why debugQuery and analysis.jsp has different result?
    4.2: From my understanding, during indexing, the word 'TechNote' will be
converted to: 1) 'technote' and 2) 'tech note' according to my config in
schema.xml. And at query time, 'TechNote' will be converted to 'tech note',
thus it SHOULD match.  Am I right?
     4.3: Why the phrase frequency 'tech note' is 0 in the output of
debugQuery result (0.0 = tf(phraseFreq=0.0))?

Any suggestion/comments are absolutely welcome!


*5. fieldType definition in schema.xml*

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
    </fieldType>


Thanks very much!

A question on WordDelimiterFilterFactory

Reply via email to