WordDelimiterFilter => MultiPhraseQuery?

jOhn Wed, 19 Aug 2009 12:23:48 -0700

My issue is with the use of WordDelimiterFilter and how the QueryParser
(Dismax) converts the query into a MultiPhraseQuery.


This is on solr 1.3 / lucene 2.4.1.

For example:

1. yuma -> 3:10 to Yuma
2. yUma -> no results

For #2 it gets split into y + uma and becomes a MultiPhraseQuery requiring
both terms thus no results vs. requiring either one with a preference on
both (or a preference on joining the terms or at least an OR query).

1. joker-man -> Joker-Man Goes For Gold
2. joKerman -> no results
3. jo-kerman -> no results

1. prom night -> Prom Night
2. PromNight -> Prom Night
3. promnight -> no results
4. pRomnIght -> no results

Is there a way to configure this behavior.  I need to support all the above
use-cases.

I have a brute force solution using a copyField and a
non-WordDelimiterFilter analyzer (whitespacetoken, lowercase, patternreplace
punctuation, edgengram) and basically drop into solrconfig.xml a 2nd field
for this (titleNameSubstring2).  Those two combined is pretty much what I
need, but that costs a memory hit + performance hit whereas some tuning to
avoid MultiPhraseQuery would be a better fit.

Here are the schema.xml + solrconfig.xml bits that are not working.

[schema.xml]

        <fieldType name="textSubstring" class="solr.TextField"
positionIncrementGap="100" omitNorms="true">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.ISOLatin1AccentFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                <filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
maxGramSize="12"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.ISOLatin1AccentFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>

[solrconfig.xml]

    <requestHandler name="stuff_title" class="solr.SearchHandler" >
        <lst name="invariants">
            <str name="defType">dismax</str>
            <str name="echoParams">explicit</str>
            <str name="sort">score desc</str>
            <str name="qf">
                titleNameSubstring^200.0
            </str>
            <str name="pf">
                titleNameSubstring^2.0
            </str>
            <str name="bf">
                product(releaseYear,0.1)
            </str>
            <str name="mm">1</str>
        </lst>
        <lst name="appends">
            <str name="fq">searchable:true</str>
        </lst>
    </requestHandler>

Any ideas?

-netcam

WordDelimiterFilter => MultiPhraseQuery?

Reply via email to