Re: Need help with troublesome wildcard query

Erick Erickson Fri, 08 Jul 2011 10:00:09 -0700

Yeah, the analysis page takes a bit of getting used to, but it's well
worth the time. Be sure to check the "verbose" box. Taking some time
to understand what it's telling you is one of the best investments
you'll make.


Your "parts of words" is the issue. One approach is to use ngrams or
edgengrams. Here's a writeup about edgengrams from Lucid:
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

it's written for autosuggest, but you get the idea. If "partial" words
could be not at the start then ngrams are a possibility....

Your problem is one of those
conceptually-simple-but-annoyingly-difficult-to-implement
ones that takes far longer to fully understand/implement than
it seems like it should.

Best
Erick

On Fri, Jul 8, 2011 at 12:44 PM, Christopher Cato
<christopher.c...@minimedia.se> wrote:
> Hi Briggs, thanks for being patient with me!
>
> Yeah, I saw I had a typo there in the OR clause. Fixed it but still no 
> perfect results.
> I'm looking at the analysis.jsp page and can't really figure it out. Feeling 
> a bit overwhelmed by all the output. I also don't know how to check if 
> stemming is used for the title field.
>
> Theoretically, given the field type I'm using and also given that "super 
> technocrane 30" is the title of one of the docs - how would one write the 
> query so that it finds that doc if the user searches for "super techn" or 
> "super technocrane"? Right now it stops matching in the middle of the word 
> "technocrane" or rather after the "r".
>
> Darnit, I just want to return all docs that contain the search terms either 
> as whole words or parts of words.
> Is it possible?
>
> Regards,
> Christopher
>
> 8 jul 2011 kl. 16.57 skrev Briggs Thompson:
>
>> Hey Chris,
>> Removing the ORs in each query might help narrow down the problem, but I
>> suggest you run this through the query analyzer in order to see where it is
>> dropping out. It is a great tool for troubleshooting issues like these.
>>
>> I see a few things here.
>>
>>   - for leading wildcard queries, you should include the
>>   reverseWildcardFilterFactory. Check out the documentation here:
>>   
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ReversedWildcardFilterFactory
>>   - Your result might get dropped out because you are trying to do wildcard
>>   searches on a stemmed field. Wildcard searches on a stemmed field is
>>   counter-intuitive because if you index "computers", it may stem to 
>> "comput",
>>   in which wildcard query of "computer*" would not match.
>>      - If you want to support stemming and wildcard searches, I suggest
>>      creating a copy field with an un-stemmed field type definition.
>>
>> Don't forget if you modify your field type definition, you need to
>> re-index.
>>
>> In response to your question about text_ws, this is just a different field
>> type definition that essentially splits on whiteSpaces. You should use that
>> if that is what the desired search logic is, but it probably isn't. Check
>> out the documentation on each of the tokenizers and filter factories in your
>> "text" field type and see what you need and what you don't to satisfy your
>> use cases.
>>
>> Hope that helps,
>> Briggs Thompson
>>
>>
>> On Fri, Jul 8, 2011 at 9:03 AM, Christopher Cato <
>> christopher.c...@minimedia.se> wrote:
>>
>>> Hi Briggs. Thanks for taking the time. I have the query nearly working now,
>>> currently this is how it looks when it matches on the title "Super
>>> Technocrane 30" and others with similar names:
>>>
>>> INFO: [] webapp=/solr path=/select/
>>> params={qf=title^40.0&hl.fl=title&wt=json&rows=10&fl=*,score&start=0&q=(title:*super*+AND+*technocran*)+OR+(title:*super*+AND+*technocran)&qt=standard&fq=type:product+AND+language:sv}
>>> hits=3 status=0 QTime=1
>>>
>>> Adding another letter stops it matching:
>>>
>>> INFO: [] webapp=/solr path=/select/
>>> params={qf=title^40.0&hl.fl=title&wt=json&rows=10&fl=*,score&start=0&q=(title:*super*+AND+*technocrane*)+OR+(title:*super*+AND+*technocrane)&qt=standard&fq=type:product+AND+language:sv}
>>> hits=0 status=0 QTime=0
>>>
>>> The field type definitions are as follows:
>>>
>>> <field name="title" type="text" indexed="true" stored="true"
>>> termVectors="true" omitNorms="true"/>
>>>
>>>   <fieldType name="text" class="solr.TextField"
>>> positionIncrementGap="100">
>>>     <analyzer type="index">
>>>       <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>       <!-- in this example, we will only use synonyms at query time
>>>       <filter class="solr.SynonymFilterFactory"
>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>       -->
>>>       <!-- Case insensitive stop word removal.
>>>         add enablePositionIncrements=true in both the index and query
>>>         analyzers to leave a 'gap' for more accurate phrase queries.
>>>       -->
>>>       <filter class="solr.StopFilterFactory"
>>>               ignoreCase="true"
>>>               words="stopwords.txt"
>>>               enablePositionIncrements="true"
>>>               />
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>               generateWordParts="1"
>>>               generateNumberParts="1"
>>>               catenateWords="1"
>>>               catenateNumbers="1"
>>>               catenateAll="0"
>>>               splitOnCaseChange="1"
>>>               preserveOriginal="1"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.SnowballPorterFilterFactory" language="English"
>>> protected="protwords.txt"/>
>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>     </analyzer>
>>>     <analyzer type="query">
>>>       <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>> ignoreCase="true" expand="true"/>
>>>       <filter class="solr.StopFilterFactory"
>>>               ignoreCase="true"
>>>               words="stopwords.txt"
>>>               enablePositionIncrements="true"
>>>               />
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>               generateWordParts="1"
>>>               generateNumberParts="1"
>>>               catenateWords="0"
>>>               catenateNumbers="0"
>>>               catenateAll="0"
>>>               splitOnCaseChange="1"
>>>               preserveOriginal="1"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.SnowballPorterFilterFactory" language="English"
>>> protected="protwords.txt"/>
>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>     </analyzer>
>>>   </fieldType>
>>>
>>>
>>> There is also a type definition that is called text_ws, should I use that
>>> instead and change text to text_ws in the field definition for title?
>>>
>>>   <!-- A text field that only splits on whitespace for exact matching of
>>> words -->
>>>   <fieldType name="text_ws" class="solr.TextField"
>>> positionIncrementGap="100">
>>>     <analyzer>
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>     </analyzer>
>>>   </fieldType>
>>>
>>>
>>>
>>>
>>> Mvh
>>>
>>> Christopher Cato
>>> Teknikchef
>>> -----------------------------------
>>> MiniMedia
>>> Phone: +46761927603
>>> www.minimedia.se
>>>
>>> 7 jul 2011 kl. 23.16 skrev Briggs Thompson:
>>>
>>>> Hello Christopher,
>>>>
>>>> Can you provide the exact query sent to Solr for the one word query and
>>> also
>>>> the two word query? The field type definition for your title field would
>>> be
>>>> useful too.
>>>>
>>>> From what I understand, Solr should be able to handle your use case. I am
>>>> guessing it is a problem with how the field is defined assuming the query
>>> is
>>>> correct.
>>>>
>>>> Briggs Thompson
>>>>
>>>> On Thu, Jul 7, 2011 at 12:22 PM, Christopher Cato <
>>>> christopher.c...@minimedia.se> wrote:
>>>>
>>>>> Hi, I'm running Solr 3.2 with edismax under Tomcat 6 via Drupal.
>>>>>
>>>>> I'm having some problems writing a query that matches a specific field
>>> on
>>>>> several words. I have implemented an AJAX search that basically takes
>>>>> whatever is in a form field and attempts to match documents. I'm not
>>> having
>>>>> much luck though. First word always matches correctly but as soon as I
>>> enter
>>>>> the second word I'm loosing matches, the third word doesn't give any
>>> matches
>>>>> at all.
>>>>>
>>>>> The title field that I'm searching contains a product name that may or
>>> may
>>>>> not have several words.
>>>>>
>>>>> The requirement is that the search should be progressive i.e. as the
>>> user
>>>>> inputs words I should always return results that contain all of the
>>> words
>>>>> entered. I also have to correct bad input like an erraneous space in the
>>>>> product name ex. "product name" instead of "productname".
>>>>>
>>>>> I'm wondering if there isn't an easier way to query Solr? Ideally I'd
>>> want
>>>>> to say "give me all docs that have the following text in it's titles" Is
>>>>> that possible?
>>>>>
>>>>>
>>>>> I'd really appreciate any help!
>>>>>
>>>>>
>>>>> Regards,
>>>>> Christopher Cato
>>>
>>>
>
>

Re: Need help with troublesome wildcard query

Reply via email to