Re: Need help with troublesome wildcard query

Erick Erickson Fri, 08 Jul 2011 12:16:39 -0700

Nope, that should do it (although I haven't tried that
exact set of steps). But you do have to reindex
from scratch....



Best
Erick

On Fri, Jul 8, 2011 at 1:36 PM, Christopher Cato
<christopher.c...@minimedia.se> wrote:
> Thanks for that pointer, that's really more what I want to do. And actually, 
> EdgeNGrams is stuck somewhere in the back of my head :) Yes, simple at first 
> thought but not as easy to implement as I have discovered.
>
> Well, so how do I implement something like this? I took the fieldtype 
> declaration from that blog post, added it to my schema.xml within the 
> fieldtypes part.
>
> So, if I get it all correctly, all I have to do now is to add a new field 
> with newly added fieldtype, a copyfield from the original title field, change 
> the query to use the new field and restart / reindex. Or am I missing 
> something?
>
> //Christopher
>
>
> 8 jul 2011 kl. 18.59 skrev Erick Erickson:
>
>> Yeah, the analysis page takes a bit of getting used to, but it's well
>> worth the time. Be sure to check the "verbose" box. Taking some time
>> to understand what it's telling you is one of the best investments
>> you'll make.
>>
>> Your "parts of words" is the issue. One approach is to use ngrams or
>> edgengrams. Here's a writeup about edgengrams from Lucid:
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>>
>> it's written for autosuggest, but you get the idea. If "partial" words
>> could be not at the start then ngrams are a possibility....
>>
>> Your problem is one of those
>> conceptually-simple-but-annoyingly-difficult-to-implement
>> ones that takes far longer to fully understand/implement than
>> it seems like it should.
>>
>> Best
>> Erick
>>
>> On Fri, Jul 8, 2011 at 12:44 PM, Christopher Cato
>> <christopher.c...@minimedia.se> wrote:
>>> Hi Briggs, thanks for being patient with me!
>>>
>>> Yeah, I saw I had a typo there in the OR clause. Fixed it but still no 
>>> perfect results.
>>> I'm looking at the analysis.jsp page and can't really figure it out. 
>>> Feeling a bit overwhelmed by all the output. I also don't know how to check 
>>> if stemming is used for the title field.
>>>
>>> Theoretically, given the field type I'm using and also given that "super 
>>> technocrane 30" is the title of one of the docs - how would one write the 
>>> query so that it finds that doc if the user searches for "super techn" or 
>>> "super technocrane"? Right now it stops matching in the middle of the word 
>>> "technocrane" or rather after the "r".
>>>
>>> Darnit, I just want to return all docs that contain the search terms either 
>>> as whole words or parts of words.
>>> Is it possible?
>>>
>>> Regards,
>>> Christopher
>>>
>>> 8 jul 2011 kl. 16.57 skrev Briggs Thompson:
>>>
>>>> Hey Chris,
>>>> Removing the ORs in each query might help narrow down the problem, but I
>>>> suggest you run this through the query analyzer in order to see where it is
>>>> dropping out. It is a great tool for troubleshooting issues like these.
>>>>
>>>> I see a few things here.
>>>>
>>>>   - for leading wildcard queries, you should include the
>>>>   reverseWildcardFilterFactory. Check out the documentation here:
>>>>   
>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ReversedWildcardFilterFactory
>>>>   - Your result might get dropped out because you are trying to do wildcard
>>>>   searches on a stemmed field. Wildcard searches on a stemmed field is
>>>>   counter-intuitive because if you index "computers", it may stem to 
>>>> "comput",
>>>>   in which wildcard query of "computer*" would not match.
>>>>      - If you want to support stemming and wildcard searches, I suggest
>>>>      creating a copy field with an un-stemmed field type definition.
>>>>
>>>> Don't forget if you modify your field type definition, you need to
>>>> re-index.
>>>>
>>>> In response to your question about text_ws, this is just a different field
>>>> type definition that essentially splits on whiteSpaces. You should use that
>>>> if that is what the desired search logic is, but it probably isn't. Check
>>>> out the documentation on each of the tokenizers and filter factories in 
>>>> your
>>>> "text" field type and see what you need and what you don't to satisfy your
>>>> use cases.
>>>>
>>>> Hope that helps,
>>>> Briggs Thompson
>>>>
>>>>
>>>> On Fri, Jul 8, 2011 at 9:03 AM, Christopher Cato <
>>>> christopher.c...@minimedia.se> wrote:
>>>>
>>>>> Hi Briggs. Thanks for taking the time. I have the query nearly working 
>>>>> now,
>>>>> currently this is how it looks when it matches on the title "Super
>>>>> Technocrane 30" and others with similar names:
>>>>>
>>>>> INFO: [] webapp=/solr path=/select/
>>>>> params={qf=title^40.0&hl.fl=title&wt=json&rows=10&fl=*,score&start=0&q=(title:*super*+AND+*technocran*)+OR+(title:*super*+AND+*technocran)&qt=standard&fq=type:product+AND+language:sv}
>>>>> hits=3 status=0 QTime=1
>>>>>
>>>>> Adding another letter stops it matching:
>>>>>
>>>>> INFO: [] webapp=/solr path=/select/
>>>>> params={qf=title^40.0&hl.fl=title&wt=json&rows=10&fl=*,score&start=0&q=(title:*super*+AND+*technocrane*)+OR+(title:*super*+AND+*technocrane)&qt=standard&fq=type:product+AND+language:sv}
>>>>> hits=0 status=0 QTime=0
>>>>>
>>>>> The field type definitions are as follows:
>>>>>
>>>>> <field name="title" type="text" indexed="true" stored="true"
>>>>> termVectors="true" omitNorms="true"/>
>>>>>
>>>>>   <fieldType name="text" class="solr.TextField"
>>>>> positionIncrementGap="100">
>>>>>     <analyzer type="index">
>>>>>       <charFilter class="solr.MappingCharFilterFactory"
>>>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>       <!-- in this example, we will only use synonyms at query time
>>>>>       <filter class="solr.SynonymFilterFactory"
>>>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>>>       -->
>>>>>       <!-- Case insensitive stop word removal.
>>>>>         add enablePositionIncrements=true in both the index and query
>>>>>         analyzers to leave a 'gap' for more accurate phrase queries.
>>>>>       -->
>>>>>       <filter class="solr.StopFilterFactory"
>>>>>               ignoreCase="true"
>>>>>               words="stopwords.txt"
>>>>>               enablePositionIncrements="true"
>>>>>               />
>>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>>>               generateWordParts="1"
>>>>>               generateNumberParts="1"
>>>>>               catenateWords="1"
>>>>>               catenateNumbers="1"
>>>>>               catenateAll="0"
>>>>>               splitOnCaseChange="1"
>>>>>               preserveOriginal="1"/>
>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>       <filter class="solr.SnowballPorterFilterFactory" language="English"
>>>>> protected="protwords.txt"/>
>>>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>     </analyzer>
>>>>>     <analyzer type="query">
>>>>>       <charFilter class="solr.MappingCharFilterFactory"
>>>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>>>> ignoreCase="true" expand="true"/>
>>>>>       <filter class="solr.StopFilterFactory"
>>>>>               ignoreCase="true"
>>>>>               words="stopwords.txt"
>>>>>               enablePositionIncrements="true"
>>>>>               />
>>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>>>               generateWordParts="1"
>>>>>               generateNumberParts="1"
>>>>>               catenateWords="0"
>>>>>               catenateNumbers="0"
>>>>>               catenateAll="0"
>>>>>               splitOnCaseChange="1"
>>>>>               preserveOriginal="1"/>
>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>       <filter class="solr.SnowballPorterFilterFactory" language="English"
>>>>> protected="protwords.txt"/>
>>>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>     </analyzer>
>>>>>   </fieldType>
>>>>>
>>>>>
>>>>> There is also a type definition that is called text_ws, should I use that
>>>>> instead and change text to text_ws in the field definition for title?
>>>>>
>>>>>   <!-- A text field that only splits on whitespace for exact matching of
>>>>> words -->
>>>>>   <fieldType name="text_ws" class="solr.TextField"
>>>>> positionIncrementGap="100">
>>>>>     <analyzer>
>>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>     </analyzer>
>>>>>   </fieldType>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Mvh
>>>>>
>>>>> Christopher Cato
>>>>> Teknikchef
>>>>> -----------------------------------
>>>>> MiniMedia
>>>>> Phone: +46761927603
>>>>> www.minimedia.se
>>>>>
>>>>> 7 jul 2011 kl. 23.16 skrev Briggs Thompson:
>>>>>
>>>>>> Hello Christopher,
>>>>>>
>>>>>> Can you provide the exact query sent to Solr for the one word query and
>>>>> also
>>>>>> the two word query? The field type definition for your title field would
>>>>> be
>>>>>> useful too.
>>>>>>
>>>>>> From what I understand, Solr should be able to handle your use case. I am
>>>>>> guessing it is a problem with how the field is defined assuming the query
>>>>> is
>>>>>> correct.
>>>>>>
>>>>>> Briggs Thompson
>>>>>>
>>>>>> On Thu, Jul 7, 2011 at 12:22 PM, Christopher Cato <
>>>>>> christopher.c...@minimedia.se> wrote:
>>>>>>
>>>>>>> Hi, I'm running Solr 3.2 with edismax under Tomcat 6 via Drupal.
>>>>>>>
>>>>>>> I'm having some problems writing a query that matches a specific field
>>>>> on
>>>>>>> several words. I have implemented an AJAX search that basically takes
>>>>>>> whatever is in a form field and attempts to match documents. I'm not
>>>>> having
>>>>>>> much luck though. First word always matches correctly but as soon as I
>>>>> enter
>>>>>>> the second word I'm loosing matches, the third word doesn't give any
>>>>> matches
>>>>>>> at all.
>>>>>>>
>>>>>>> The title field that I'm searching contains a product name that may or
>>>>> may
>>>>>>> not have several words.
>>>>>>>
>>>>>>> The requirement is that the search should be progressive i.e. as the
>>>>> user
>>>>>>> inputs words I should always return results that contain all of the
>>>>> words
>>>>>>> entered. I also have to correct bad input like an erraneous space in the
>>>>>>> product name ex. "product name" instead of "productname".
>>>>>>>
>>>>>>> I'm wondering if there isn't an easier way to query Solr? Ideally I'd
>>>>> want
>>>>>>> to say "give me all docs that have the following text in it's titles" Is
>>>>>>> that possible?
>>>>>>>
>>>>>>>
>>>>>>> I'd really appreciate any help!
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christopher Cato
>>>>>
>>>>>
>>>
>>>
>
>

Re: Need help with troublesome wildcard query

Reply via email to