Re: Need help with troublesome wildcard query

Christopher Cato Fri, 08 Jul 2011 13:20:22 -0700

And don't you know, that EdgeNGram analyzer did the trick. Added the fieldtype, 
added a new field based on it, copyfielded the old title to it, reindexed and 
hey - it works brilliantly :)


And you were right, the analysis output does make sence once it actually 
matches something :D

Thanks a million!


Mvh

Christopher Cato
Teknikchef
-----------------------------------
MiniMedia
Phone: +46761927603
www.minimedia.se

8 jul 2011 kl. 21.16 skrev Erick Erickson:

> Nope, that should do it (although I haven't tried that
> exact set of steps). But you do have to reindex
> from scratch....
> 
> 
> Best
> Erick
> 
> On Fri, Jul 8, 2011 at 1:36 PM, Christopher Cato
> <christopher.c...@minimedia.se> wrote:
>> Thanks for that pointer, that's really more what I want to do. And actually, 
>> EdgeNGrams is stuck somewhere in the back of my head :) Yes, simple at first 
>> thought but not as easy to implement as I have discovered.
>> 
>> Well, so how do I implement something like this? I took the fieldtype 
>> declaration from that blog post, added it to my schema.xml within the 
>> fieldtypes part.
>> 
>> So, if I get it all correctly, all I have to do now is to add a new field 
>> with newly added fieldtype, a copyfield from the original title field, 
>> change the query to use the new field and restart / reindex. Or am I missing 
>> something?
>> 
>> //Christopher
>> 
>> 
>> 8 jul 2011 kl. 18.59 skrev Erick Erickson:
>> 
>>> Yeah, the analysis page takes a bit of getting used to, but it's well
>>> worth the time. Be sure to check the "verbose" box. Taking some time
>>> to understand what it's telling you is one of the best investments
>>> you'll make.
>>> 
>>> Your "parts of words" is the issue. One approach is to use ngrams or
>>> edgengrams. Here's a writeup about edgengrams from Lucid:
>>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>>> 
>>> it's written for autosuggest, but you get the idea. If "partial" words
>>> could be not at the start then ngrams are a possibility....
>>> 
>>> Your problem is one of those
>>> conceptually-simple-but-annoyingly-difficult-to-implement
>>> ones that takes far longer to fully understand/implement than
>>> it seems like it should.
>>> 
>>> Best
>>> Erick
>>> 
>>> On Fri, Jul 8, 2011 at 12:44 PM, Christopher Cato
>>> <christopher.c...@minimedia.se> wrote:
>>>> Hi Briggs, thanks for being patient with me!
>>>> 
>>>> Yeah, I saw I had a typo there in the OR clause. Fixed it but still no 
>>>> perfect results.
>>>> I'm looking at the analysis.jsp page and can't really figure it out. 
>>>> Feeling a bit overwhelmed by all the output. I also don't know how to 
>>>> check if stemming is used for the title field.
>>>> 
>>>> Theoretically, given the field type I'm using and also given that "super 
>>>> technocrane 30" is the title of one of the docs - how would one write the 
>>>> query so that it finds that doc if the user searches for "super techn" or 
>>>> "super technocrane"? Right now it stops matching in the middle of the word 
>>>> "technocrane" or rather after the "r".
>>>> 
>>>> Darnit, I just want to return all docs that contain the search terms 
>>>> either as whole words or parts of words.
>>>> Is it possible?
>>>> 
>>>> Regards,
>>>> Christopher
>>>> 
>>>> 8 jul 2011 kl. 16.57 skrev Briggs Thompson:
>>>> 
>>>>> Hey Chris,
>>>>> Removing the ORs in each query might help narrow down the problem, but I
>>>>> suggest you run this through the query analyzer in order to see where it 
>>>>> is
>>>>> dropping out. It is a great tool for troubleshooting issues like these.
>>>>> 
>>>>> I see a few things here.
>>>>> 
>>>>>   - for leading wildcard queries, you should include the
>>>>>   reverseWildcardFilterFactory. Check out the documentation here:
>>>>>   
>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ReversedWildcardFilterFactory
>>>>>   - Your result might get dropped out because you are trying to do 
>>>>> wildcard
>>>>>   searches on a stemmed field. Wildcard searches on a stemmed field is
>>>>>   counter-intuitive because if you index "computers", it may stem to 
>>>>> "comput",
>>>>>   in which wildcard query of "computer*" would not match.
>>>>>      - If you want to support stemming and wildcard searches, I suggest
>>>>>      creating a copy field with an un-stemmed field type definition.
>>>>> 
>>>>> Don't forget if you modify your field type definition, you need to
>>>>> re-index.
>>>>> 
>>>>> In response to your question about text_ws, this is just a different field
>>>>> type definition that essentially splits on whiteSpaces. You should use 
>>>>> that
>>>>> if that is what the desired search logic is, but it probably isn't. Check
>>>>> out the documentation on each of the tokenizers and filter factories in 
>>>>> your
>>>>> "text" field type and see what you need and what you don't to satisfy your
>>>>> use cases.
>>>>> 
>>>>> Hope that helps,
>>>>> Briggs Thompson
>>>>> 
>>>>> 
>>>>> On Fri, Jul 8, 2011 at 9:03 AM, Christopher Cato <
>>>>> christopher.c...@minimedia.se> wrote:
>>>>> 
>>>>>> Hi Briggs. Thanks for taking the time. I have the query nearly working 
>>>>>> now,
>>>>>> currently this is how it looks when it matches on the title "Super
>>>>>> Technocrane 30" and others with similar names:
>>>>>> 
>>>>>> INFO: [] webapp=/solr path=/select/
>>>>>> params={qf=title^40.0&hl.fl=title&wt=json&rows=10&fl=*,score&start=0&q=(title:*super*+AND+*technocran*)+OR+(title:*super*+AND+*technocran)&qt=standard&fq=type:product+AND+language:sv}
>>>>>> hits=3 status=0 QTime=1
>>>>>> 
>>>>>> Adding another letter stops it matching:
>>>>>> 
>>>>>> INFO: [] webapp=/solr path=/select/
>>>>>> params={qf=title^40.0&hl.fl=title&wt=json&rows=10&fl=*,score&start=0&q=(title:*super*+AND+*technocrane*)+OR+(title:*super*+AND+*technocrane)&qt=standard&fq=type:product+AND+language:sv}
>>>>>> hits=0 status=0 QTime=0
>>>>>> 
>>>>>> The field type definitions are as follows:
>>>>>> 
>>>>>> <field name="title" type="text" indexed="true" stored="true"
>>>>>> termVectors="true" omitNorms="true"/>
>>>>>> 
>>>>>>   <fieldType name="text" class="solr.TextField"
>>>>>> positionIncrementGap="100">
>>>>>>     <analyzer type="index">
>>>>>>       <charFilter class="solr.MappingCharFilterFactory"
>>>>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>       <!-- in this example, we will only use synonyms at query time
>>>>>>       <filter class="solr.SynonymFilterFactory"
>>>>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>>>>>       -->
>>>>>>       <!-- Case insensitive stop word removal.
>>>>>>         add enablePositionIncrements=true in both the index and query
>>>>>>         analyzers to leave a 'gap' for more accurate phrase queries.
>>>>>>       -->
>>>>>>       <filter class="solr.StopFilterFactory"
>>>>>>               ignoreCase="true"
>>>>>>               words="stopwords.txt"
>>>>>>               enablePositionIncrements="true"
>>>>>>               />
>>>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>>>>               generateWordParts="1"
>>>>>>               generateNumberParts="1"
>>>>>>               catenateWords="1"
>>>>>>               catenateNumbers="1"
>>>>>>               catenateAll="0"
>>>>>>               splitOnCaseChange="1"
>>>>>>               preserveOriginal="1"/>
>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>       <filter class="solr.SnowballPorterFilterFactory" language="English"
>>>>>> protected="protwords.txt"/>
>>>>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>>     </analyzer>
>>>>>>     <analyzer type="query">
>>>>>>       <charFilter class="solr.MappingCharFilterFactory"
>>>>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>>>>> ignoreCase="true" expand="true"/>
>>>>>>       <filter class="solr.StopFilterFactory"
>>>>>>               ignoreCase="true"
>>>>>>               words="stopwords.txt"
>>>>>>               enablePositionIncrements="true"
>>>>>>               />
>>>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>>>>               generateWordParts="1"
>>>>>>               generateNumberParts="1"
>>>>>>               catenateWords="0"
>>>>>>               catenateNumbers="0"
>>>>>>               catenateAll="0"
>>>>>>               splitOnCaseChange="1"
>>>>>>               preserveOriginal="1"/>
>>>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>       <filter class="solr.SnowballPorterFilterFactory" language="English"
>>>>>> protected="protwords.txt"/>
>>>>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>>     </analyzer>
>>>>>>   </fieldType>
>>>>>> 
>>>>>> 
>>>>>> There is also a type definition that is called text_ws, should I use that
>>>>>> instead and change text to text_ws in the field definition for title?
>>>>>> 
>>>>>>   <!-- A text field that only splits on whitespace for exact matching of
>>>>>> words -->
>>>>>>   <fieldType name="text_ws" class="solr.TextField"
>>>>>> positionIncrementGap="100">
>>>>>>     <analyzer>
>>>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>     </analyzer>
>>>>>>   </fieldType>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Mvh
>>>>>> 
>>>>>> Christopher Cato
>>>>>> Teknikchef
>>>>>> -----------------------------------
>>>>>> MiniMedia
>>>>>> Phone: +46761927603
>>>>>> www.minimedia.se
>>>>>> 
>>>>>> 7 jul 2011 kl. 23.16 skrev Briggs Thompson:
>>>>>> 
>>>>>>> Hello Christopher,
>>>>>>> 
>>>>>>> Can you provide the exact query sent to Solr for the one word query and
>>>>>> also
>>>>>>> the two word query? The field type definition for your title field would
>>>>>> be
>>>>>>> useful too.
>>>>>>> 
>>>>>>> From what I understand, Solr should be able to handle your use case. I 
>>>>>>> am
>>>>>>> guessing it is a problem with how the field is defined assuming the 
>>>>>>> query
>>>>>> is
>>>>>>> correct.
>>>>>>> 
>>>>>>> Briggs Thompson
>>>>>>> 
>>>>>>> On Thu, Jul 7, 2011 at 12:22 PM, Christopher Cato <
>>>>>>> christopher.c...@minimedia.se> wrote:
>>>>>>> 
>>>>>>>> Hi, I'm running Solr 3.2 with edismax under Tomcat 6 via Drupal.
>>>>>>>> 
>>>>>>>> I'm having some problems writing a query that matches a specific field
>>>>>> on
>>>>>>>> several words. I have implemented an AJAX search that basically takes
>>>>>>>> whatever is in a form field and attempts to match documents. I'm not
>>>>>> having
>>>>>>>> much luck though. First word always matches correctly but as soon as I
>>>>>> enter
>>>>>>>> the second word I'm loosing matches, the third word doesn't give any
>>>>>> matches
>>>>>>>> at all.
>>>>>>>> 
>>>>>>>> The title field that I'm searching contains a product name that may or
>>>>>> may
>>>>>>>> not have several words.
>>>>>>>> 
>>>>>>>> The requirement is that the search should be progressive i.e. as the
>>>>>> user
>>>>>>>> inputs words I should always return results that contain all of the
>>>>>> words
>>>>>>>> entered. I also have to correct bad input like an erraneous space in 
>>>>>>>> the
>>>>>>>> product name ex. "product name" instead of "productname".
>>>>>>>> 
>>>>>>>> I'm wondering if there isn't an easier way to query Solr? Ideally I'd
>>>>>> want
>>>>>>>> to say "give me all docs that have the following text in it's titles" 
>>>>>>>> Is
>>>>>>>> that possible?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I'd really appreciate any help!
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Christopher Cato
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Need help with troublesome wildcard query

Reply via email to