Re: Need help with troublesome wildcard query

Christopher Cato Fri, 08 Jul 2011 10:37:01 -0700

Thanks for that pointer, that's really more what I want to do. And actually, 
EdgeNGrams is stuck somewhere in the back of my head :) Yes, simple at first 
thought but not as easy to implement as I have discovered.


Well, so how do I implement something like this? I took the fieldtype 
declaration from that blog post, added it to my schema.xml within the 
fieldtypes part.

So, if I get it all correctly, all I have to do now is to add a new field with 
newly added fieldtype, a copyfield from the original title field, change the 
query to use the new field and restart / reindex. Or am I missing something?

//Christopher


8 jul 2011 kl. 18.59 skrev Erick Erickson:

> Yeah, the analysis page takes a bit of getting used to, but it's well
> worth the time. Be sure to check the "verbose" box. Taking some time
> to understand what it's telling you is one of the best investments
> you'll make.
> 
> Your "parts of words" is the issue. One approach is to use ngrams or
> edgengrams. Here's a writeup about edgengrams from Lucid:
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> 
> it's written for autosuggest, but you get the idea. If "partial" words
> could be not at the start then ngrams are a possibility....
> 
> Your problem is one of those
> conceptually-simple-but-annoyingly-difficult-to-implement
> ones that takes far longer to fully understand/implement than
> it seems like it should.
> 
> Best
> Erick
> 
> On Fri, Jul 8, 2011 at 12:44 PM, Christopher Cato
> <[email protected]> wrote:
>> Hi Briggs, thanks for being patient with me!
>> 
>> Yeah, I saw I had a typo there in the OR clause. Fixed it but still no 
>> perfect results.
>> I'm looking at the analysis.jsp page and can't really figure it out. Feeling 
>> a bit overwhelmed by all the output. I also don't know how to check if 
>> stemming is used for the title field.
>> 
>> Theoretically, given the field type I'm using and also given that "super 
>> technocrane 30" is the title of one of the docs - how would one write the 
>> query so that it finds that doc if the user searches for "super techn" or 
>> "super technocrane"? Right now it stops matching in the middle of the word 
>> "technocrane" or rather after the "r".
>> 
>> Darnit, I just want to return all docs that contain the search terms either 
>> as whole words or parts of words.
>> Is it possible?
>> 
>> Regards,
>> Christopher
>> 
>> 8 jul 2011 kl. 16.57 skrev Briggs Thompson:
>> 
>>> Hey Chris,
>>> Removing the ORs in each query might help narrow down the problem, but I
>>> suggest you run this through the query analyzer in order to see where it is
>>> dropping out. It is a great tool for troubleshooting issues like these.
>>> 
>>> I see a few things here.
>>> 
>>>   - for leading wildcard queries, you should include the
>>>   reverseWildcardFilterFactory. Check out the documentation here:
>>>   
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ReversedWildcardFilterFactory
>>>   - Your result might get dropped out because you are trying to do wildcard
>>>   searches on a stemmed field. Wildcard searches on a stemmed field is
>>>   counter-intuitive because if you index "computers", it may stem to 
>>> "comput",
>>>   in which wildcard query of "computer*" would not match.
>>>      - If you want to support stemming and wildcard searches, I suggest
>>>      creating a copy field with an un-stemmed field type definition.
>>> 
>>> Don't forget if you modify your field type definition, you need to
>>> re-index.
>>> 
>>> In response to your question about text_ws, this is just a different field
>>> type definition that essentially splits on whiteSpaces. You should use that
>>> if that is what the desired search logic is, but it probably isn't. Check
>>> out the documentation on each of the tokenizers and filter factories in your
>>> "text" field type and see what you need and what you don't to satisfy your
>>> use cases.
>>> 
>>> Hope that helps,
>>> Briggs Thompson
>>> 
>>> 
>>> On Fri, Jul 8, 2011 at 9:03 AM, Christopher Cato <
>>> [email protected]> wrote:
>>> 
>>>> Hi Briggs. Thanks for taking the time. I have the query nearly working now,
>>>> currently this is how it looks when it matches on the title "Super
>>>> Technocrane 30" and others with similar names:
>>>> 
>>>> INFO: [] webapp=/solr path=/select/
>>>> params={qf=title^40.0&hl.fl=title&wt=json&rows=10&fl=*,score&start=0&q=(title:*super*+AND+*technocran*)+OR+(title:*super*+AND+*technocran)&qt=standard&fq=type:product+AND+language:sv}
>>>> hits=3 status=0 QTime=1
>>>> 
>>>> Adding another letter stops it matching:
>>>> 
>>>> INFO: [] webapp=/solr path=/select/
>>>> params={qf=title^40.0&hl.fl=title&wt=json&rows=10&fl=*,score&start=0&q=(title:*super*+AND+*technocrane*)+OR+(title:*super*+AND+*technocrane)&qt=standard&fq=type:product+AND+language:sv}
>>>> hits=0 status=0 QTime=0
>>>> 
>>>> The field type definitions are as follows:
>>>> 
>>>> <field name="title" type="text" indexed="true" stored="true"
>>>> termVectors="true" omitNorms="true"/>
>>>> 
>>>>   <fieldType name="text" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>>     <analyzer type="index">
>>>>       <charFilter class="solr.MappingCharFilterFactory"
>>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>       <!-- in this example, we will only use synonyms at query time
>>>>       <filter class="solr.SynonymFilterFactory"
>>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>>       -->
>>>>       <!-- Case insensitive stop word removal.
>>>>         add enablePositionIncrements=true in both the index and query
>>>>         analyzers to leave a 'gap' for more accurate phrase queries.
>>>>       -->
>>>>       <filter class="solr.StopFilterFactory"
>>>>               ignoreCase="true"
>>>>               words="stopwords.txt"
>>>>               enablePositionIncrements="true"
>>>>               />
>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>>               generateWordParts="1"
>>>>               generateNumberParts="1"
>>>>               catenateWords="1"
>>>>               catenateNumbers="1"
>>>>               catenateAll="0"
>>>>               splitOnCaseChange="1"
>>>>               preserveOriginal="1"/>
>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>       <filter class="solr.SnowballPorterFilterFactory" language="English"
>>>> protected="protwords.txt"/>
>>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>     </analyzer>
>>>>     <analyzer type="query">
>>>>       <charFilter class="solr.MappingCharFilterFactory"
>>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>>> ignoreCase="true" expand="true"/>
>>>>       <filter class="solr.StopFilterFactory"
>>>>               ignoreCase="true"
>>>>               words="stopwords.txt"
>>>>               enablePositionIncrements="true"
>>>>               />
>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>>               generateWordParts="1"
>>>>               generateNumberParts="1"
>>>>               catenateWords="0"
>>>>               catenateNumbers="0"
>>>>               catenateAll="0"
>>>>               splitOnCaseChange="1"
>>>>               preserveOriginal="1"/>
>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>       <filter class="solr.SnowballPorterFilterFactory" language="English"
>>>> protected="protwords.txt"/>
>>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>     </analyzer>
>>>>   </fieldType>
>>>> 
>>>> 
>>>> There is also a type definition that is called text_ws, should I use that
>>>> instead and change text to text_ws in the field definition for title?
>>>> 
>>>>   <!-- A text field that only splits on whitespace for exact matching of
>>>> words -->
>>>>   <fieldType name="text_ws" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>>     <analyzer>
>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>     </analyzer>
>>>>   </fieldType>
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Mvh
>>>> 
>>>> Christopher Cato
>>>> Teknikchef
>>>> -----------------------------------
>>>> MiniMedia
>>>> Phone: +46761927603
>>>> www.minimedia.se
>>>> 
>>>> 7 jul 2011 kl. 23.16 skrev Briggs Thompson:
>>>> 
>>>>> Hello Christopher,
>>>>> 
>>>>> Can you provide the exact query sent to Solr for the one word query and
>>>> also
>>>>> the two word query? The field type definition for your title field would
>>>> be
>>>>> useful too.
>>>>> 
>>>>> From what I understand, Solr should be able to handle your use case. I am
>>>>> guessing it is a problem with how the field is defined assuming the query
>>>> is
>>>>> correct.
>>>>> 
>>>>> Briggs Thompson
>>>>> 
>>>>> On Thu, Jul 7, 2011 at 12:22 PM, Christopher Cato <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> Hi, I'm running Solr 3.2 with edismax under Tomcat 6 via Drupal.
>>>>>> 
>>>>>> I'm having some problems writing a query that matches a specific field
>>>> on
>>>>>> several words. I have implemented an AJAX search that basically takes
>>>>>> whatever is in a form field and attempts to match documents. I'm not
>>>> having
>>>>>> much luck though. First word always matches correctly but as soon as I
>>>> enter
>>>>>> the second word I'm loosing matches, the third word doesn't give any
>>>> matches
>>>>>> at all.
>>>>>> 
>>>>>> The title field that I'm searching contains a product name that may or
>>>> may
>>>>>> not have several words.
>>>>>> 
>>>>>> The requirement is that the search should be progressive i.e. as the
>>>> user
>>>>>> inputs words I should always return results that contain all of the
>>>> words
>>>>>> entered. I also have to correct bad input like an erraneous space in the
>>>>>> product name ex. "product name" instead of "productname".
>>>>>> 
>>>>>> I'm wondering if there isn't an easier way to query Solr? Ideally I'd
>>>> want
>>>>>> to say "give me all docs that have the following text in it's titles" Is
>>>>>> that possible?
>>>>>> 
>>>>>> 
>>>>>> I'd really appreciate any help!
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Christopher Cato
>>>> 
>>>> 
>> 
>>

Re: Need help with troublesome wildcard query

Reply via email to