Thanks for that pointer, that's really more what I want to do. And actually, EdgeNGrams is stuck somewhere in the back of my head :) Yes, simple at first thought but not as easy to implement as I have discovered.
Well, so how do I implement something like this? I took the fieldtype declaration from that blog post, added it to my schema.xml within the fieldtypes part. So, if I get it all correctly, all I have to do now is to add a new field with newly added fieldtype, a copyfield from the original title field, change the query to use the new field and restart / reindex. Or am I missing something? //Christopher 8 jul 2011 kl. 18.59 skrev Erick Erickson: > Yeah, the analysis page takes a bit of getting used to, but it's well > worth the time. Be sure to check the "verbose" box. Taking some time > to understand what it's telling you is one of the best investments > you'll make. > > Your "parts of words" is the issue. One approach is to use ngrams or > edgengrams. Here's a writeup about edgengrams from Lucid: > http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ > > it's written for autosuggest, but you get the idea. If "partial" words > could be not at the start then ngrams are a possibility.... > > Your problem is one of those > conceptually-simple-but-annoyingly-difficult-to-implement > ones that takes far longer to fully understand/implement than > it seems like it should. > > Best > Erick > > On Fri, Jul 8, 2011 at 12:44 PM, Christopher Cato > <[email protected]> wrote: >> Hi Briggs, thanks for being patient with me! >> >> Yeah, I saw I had a typo there in the OR clause. Fixed it but still no >> perfect results. >> I'm looking at the analysis.jsp page and can't really figure it out. Feeling >> a bit overwhelmed by all the output. I also don't know how to check if >> stemming is used for the title field. >> >> Theoretically, given the field type I'm using and also given that "super >> technocrane 30" is the title of one of the docs - how would one write the >> query so that it finds that doc if the user searches for "super techn" or >> "super technocrane"? Right now it stops matching in the middle of the word >> "technocrane" or rather after the "r". >> >> Darnit, I just want to return all docs that contain the search terms either >> as whole words or parts of words. >> Is it possible? >> >> Regards, >> Christopher >> >> 8 jul 2011 kl. 16.57 skrev Briggs Thompson: >> >>> Hey Chris, >>> Removing the ORs in each query might help narrow down the problem, but I >>> suggest you run this through the query analyzer in order to see where it is >>> dropping out. It is a great tool for troubleshooting issues like these. >>> >>> I see a few things here. >>> >>> - for leading wildcard queries, you should include the >>> reverseWildcardFilterFactory. Check out the documentation here: >>> >>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ReversedWildcardFilterFactory >>> - Your result might get dropped out because you are trying to do wildcard >>> searches on a stemmed field. Wildcard searches on a stemmed field is >>> counter-intuitive because if you index "computers", it may stem to >>> "comput", >>> in which wildcard query of "computer*" would not match. >>> - If you want to support stemming and wildcard searches, I suggest >>> creating a copy field with an un-stemmed field type definition. >>> >>> Don't forget if you modify your field type definition, you need to >>> re-index. >>> >>> In response to your question about text_ws, this is just a different field >>> type definition that essentially splits on whiteSpaces. You should use that >>> if that is what the desired search logic is, but it probably isn't. Check >>> out the documentation on each of the tokenizers and filter factories in your >>> "text" field type and see what you need and what you don't to satisfy your >>> use cases. >>> >>> Hope that helps, >>> Briggs Thompson >>> >>> >>> On Fri, Jul 8, 2011 at 9:03 AM, Christopher Cato < >>> [email protected]> wrote: >>> >>>> Hi Briggs. Thanks for taking the time. I have the query nearly working now, >>>> currently this is how it looks when it matches on the title "Super >>>> Technocrane 30" and others with similar names: >>>> >>>> INFO: [] webapp=/solr path=/select/ >>>> params={qf=title^40.0&hl.fl=title&wt=json&rows=10&fl=*,score&start=0&q=(title:*super*+AND+*technocran*)+OR+(title:*super*+AND+*technocran)&qt=standard&fq=type:product+AND+language:sv} >>>> hits=3 status=0 QTime=1 >>>> >>>> Adding another letter stops it matching: >>>> >>>> INFO: [] webapp=/solr path=/select/ >>>> params={qf=title^40.0&hl.fl=title&wt=json&rows=10&fl=*,score&start=0&q=(title:*super*+AND+*technocrane*)+OR+(title:*super*+AND+*technocrane)&qt=standard&fq=type:product+AND+language:sv} >>>> hits=0 status=0 QTime=0 >>>> >>>> The field type definitions are as follows: >>>> >>>> <field name="title" type="text" indexed="true" stored="true" >>>> termVectors="true" omitNorms="true"/> >>>> >>>> <fieldType name="text" class="solr.TextField" >>>> positionIncrementGap="100"> >>>> <analyzer type="index"> >>>> <charFilter class="solr.MappingCharFilterFactory" >>>> mapping="mapping-ISOLatin1Accent.txt"/> >>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>> <!-- in this example, we will only use synonyms at query time >>>> <filter class="solr.SynonymFilterFactory" >>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> >>>> --> >>>> <!-- Case insensitive stop word removal. >>>> add enablePositionIncrements=true in both the index and query >>>> analyzers to leave a 'gap' for more accurate phrase queries. >>>> --> >>>> <filter class="solr.StopFilterFactory" >>>> ignoreCase="true" >>>> words="stopwords.txt" >>>> enablePositionIncrements="true" >>>> /> >>>> <filter class="solr.WordDelimiterFilterFactory" >>>> generateWordParts="1" >>>> generateNumberParts="1" >>>> catenateWords="1" >>>> catenateNumbers="1" >>>> catenateAll="0" >>>> splitOnCaseChange="1" >>>> preserveOriginal="1"/> >>>> <filter class="solr.LowerCaseFilterFactory"/> >>>> <filter class="solr.SnowballPorterFilterFactory" language="English" >>>> protected="protwords.txt"/> >>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >>>> </analyzer> >>>> <analyzer type="query"> >>>> <charFilter class="solr.MappingCharFilterFactory" >>>> mapping="mapping-ISOLatin1Accent.txt"/> >>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" >>>> ignoreCase="true" expand="true"/> >>>> <filter class="solr.StopFilterFactory" >>>> ignoreCase="true" >>>> words="stopwords.txt" >>>> enablePositionIncrements="true" >>>> /> >>>> <filter class="solr.WordDelimiterFilterFactory" >>>> generateWordParts="1" >>>> generateNumberParts="1" >>>> catenateWords="0" >>>> catenateNumbers="0" >>>> catenateAll="0" >>>> splitOnCaseChange="1" >>>> preserveOriginal="1"/> >>>> <filter class="solr.LowerCaseFilterFactory"/> >>>> <filter class="solr.SnowballPorterFilterFactory" language="English" >>>> protected="protwords.txt"/> >>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >>>> </analyzer> >>>> </fieldType> >>>> >>>> >>>> There is also a type definition that is called text_ws, should I use that >>>> instead and change text to text_ws in the field definition for title? >>>> >>>> <!-- A text field that only splits on whitespace for exact matching of >>>> words --> >>>> <fieldType name="text_ws" class="solr.TextField" >>>> positionIncrementGap="100"> >>>> <analyzer> >>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>> </analyzer> >>>> </fieldType> >>>> >>>> >>>> >>>> >>>> Mvh >>>> >>>> Christopher Cato >>>> Teknikchef >>>> ----------------------------------- >>>> MiniMedia >>>> Phone: +46761927603 >>>> www.minimedia.se >>>> >>>> 7 jul 2011 kl. 23.16 skrev Briggs Thompson: >>>> >>>>> Hello Christopher, >>>>> >>>>> Can you provide the exact query sent to Solr for the one word query and >>>> also >>>>> the two word query? The field type definition for your title field would >>>> be >>>>> useful too. >>>>> >>>>> From what I understand, Solr should be able to handle your use case. I am >>>>> guessing it is a problem with how the field is defined assuming the query >>>> is >>>>> correct. >>>>> >>>>> Briggs Thompson >>>>> >>>>> On Thu, Jul 7, 2011 at 12:22 PM, Christopher Cato < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi, I'm running Solr 3.2 with edismax under Tomcat 6 via Drupal. >>>>>> >>>>>> I'm having some problems writing a query that matches a specific field >>>> on >>>>>> several words. I have implemented an AJAX search that basically takes >>>>>> whatever is in a form field and attempts to match documents. I'm not >>>> having >>>>>> much luck though. First word always matches correctly but as soon as I >>>> enter >>>>>> the second word I'm loosing matches, the third word doesn't give any >>>> matches >>>>>> at all. >>>>>> >>>>>> The title field that I'm searching contains a product name that may or >>>> may >>>>>> not have several words. >>>>>> >>>>>> The requirement is that the search should be progressive i.e. as the >>>> user >>>>>> inputs words I should always return results that contain all of the >>>> words >>>>>> entered. I also have to correct bad input like an erraneous space in the >>>>>> product name ex. "product name" instead of "productname". >>>>>> >>>>>> I'm wondering if there isn't an easier way to query Solr? Ideally I'd >>>> want >>>>>> to say "give me all docs that have the following text in it's titles" Is >>>>>> that possible? >>>>>> >>>>>> >>>>>> I'd really appreciate any help! >>>>>> >>>>>> >>>>>> Regards, >>>>>> Christopher Cato >>>> >>>> >> >>
