And don't you know, that EdgeNGram analyzer did the trick. Added the fieldtype, added a new field based on it, copyfielded the old title to it, reindexed and hey - it works brilliantly :)
And you were right, the analysis output does make sence once it actually matches something :D Thanks a million! Mvh Christopher Cato Teknikchef ----------------------------------- MiniMedia Phone: +46761927603 www.minimedia.se 8 jul 2011 kl. 21.16 skrev Erick Erickson: > Nope, that should do it (although I haven't tried that > exact set of steps). But you do have to reindex > from scratch.... > > > Best > Erick > > On Fri, Jul 8, 2011 at 1:36 PM, Christopher Cato > <christopher.c...@minimedia.se> wrote: >> Thanks for that pointer, that's really more what I want to do. And actually, >> EdgeNGrams is stuck somewhere in the back of my head :) Yes, simple at first >> thought but not as easy to implement as I have discovered. >> >> Well, so how do I implement something like this? I took the fieldtype >> declaration from that blog post, added it to my schema.xml within the >> fieldtypes part. >> >> So, if I get it all correctly, all I have to do now is to add a new field >> with newly added fieldtype, a copyfield from the original title field, >> change the query to use the new field and restart / reindex. Or am I missing >> something? >> >> //Christopher >> >> >> 8 jul 2011 kl. 18.59 skrev Erick Erickson: >> >>> Yeah, the analysis page takes a bit of getting used to, but it's well >>> worth the time. Be sure to check the "verbose" box. Taking some time >>> to understand what it's telling you is one of the best investments >>> you'll make. >>> >>> Your "parts of words" is the issue. One approach is to use ngrams or >>> edgengrams. Here's a writeup about edgengrams from Lucid: >>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ >>> >>> it's written for autosuggest, but you get the idea. If "partial" words >>> could be not at the start then ngrams are a possibility.... >>> >>> Your problem is one of those >>> conceptually-simple-but-annoyingly-difficult-to-implement >>> ones that takes far longer to fully understand/implement than >>> it seems like it should. >>> >>> Best >>> Erick >>> >>> On Fri, Jul 8, 2011 at 12:44 PM, Christopher Cato >>> <christopher.c...@minimedia.se> wrote: >>>> Hi Briggs, thanks for being patient with me! >>>> >>>> Yeah, I saw I had a typo there in the OR clause. Fixed it but still no >>>> perfect results. >>>> I'm looking at the analysis.jsp page and can't really figure it out. >>>> Feeling a bit overwhelmed by all the output. I also don't know how to >>>> check if stemming is used for the title field. >>>> >>>> Theoretically, given the field type I'm using and also given that "super >>>> technocrane 30" is the title of one of the docs - how would one write the >>>> query so that it finds that doc if the user searches for "super techn" or >>>> "super technocrane"? Right now it stops matching in the middle of the word >>>> "technocrane" or rather after the "r". >>>> >>>> Darnit, I just want to return all docs that contain the search terms >>>> either as whole words or parts of words. >>>> Is it possible? >>>> >>>> Regards, >>>> Christopher >>>> >>>> 8 jul 2011 kl. 16.57 skrev Briggs Thompson: >>>> >>>>> Hey Chris, >>>>> Removing the ORs in each query might help narrow down the problem, but I >>>>> suggest you run this through the query analyzer in order to see where it >>>>> is >>>>> dropping out. It is a great tool for troubleshooting issues like these. >>>>> >>>>> I see a few things here. >>>>> >>>>> - for leading wildcard queries, you should include the >>>>> reverseWildcardFilterFactory. Check out the documentation here: >>>>> >>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ReversedWildcardFilterFactory >>>>> - Your result might get dropped out because you are trying to do >>>>> wildcard >>>>> searches on a stemmed field. Wildcard searches on a stemmed field is >>>>> counter-intuitive because if you index "computers", it may stem to >>>>> "comput", >>>>> in which wildcard query of "computer*" would not match. >>>>> - If you want to support stemming and wildcard searches, I suggest >>>>> creating a copy field with an un-stemmed field type definition. >>>>> >>>>> Don't forget if you modify your field type definition, you need to >>>>> re-index. >>>>> >>>>> In response to your question about text_ws, this is just a different field >>>>> type definition that essentially splits on whiteSpaces. You should use >>>>> that >>>>> if that is what the desired search logic is, but it probably isn't. Check >>>>> out the documentation on each of the tokenizers and filter factories in >>>>> your >>>>> "text" field type and see what you need and what you don't to satisfy your >>>>> use cases. >>>>> >>>>> Hope that helps, >>>>> Briggs Thompson >>>>> >>>>> >>>>> On Fri, Jul 8, 2011 at 9:03 AM, Christopher Cato < >>>>> christopher.c...@minimedia.se> wrote: >>>>> >>>>>> Hi Briggs. Thanks for taking the time. I have the query nearly working >>>>>> now, >>>>>> currently this is how it looks when it matches on the title "Super >>>>>> Technocrane 30" and others with similar names: >>>>>> >>>>>> INFO: [] webapp=/solr path=/select/ >>>>>> params={qf=title^40.0&hl.fl=title&wt=json&rows=10&fl=*,score&start=0&q=(title:*super*+AND+*technocran*)+OR+(title:*super*+AND+*technocran)&qt=standard&fq=type:product+AND+language:sv} >>>>>> hits=3 status=0 QTime=1 >>>>>> >>>>>> Adding another letter stops it matching: >>>>>> >>>>>> INFO: [] webapp=/solr path=/select/ >>>>>> params={qf=title^40.0&hl.fl=title&wt=json&rows=10&fl=*,score&start=0&q=(title:*super*+AND+*technocrane*)+OR+(title:*super*+AND+*technocrane)&qt=standard&fq=type:product+AND+language:sv} >>>>>> hits=0 status=0 QTime=0 >>>>>> >>>>>> The field type definitions are as follows: >>>>>> >>>>>> <field name="title" type="text" indexed="true" stored="true" >>>>>> termVectors="true" omitNorms="true"/> >>>>>> >>>>>> <fieldType name="text" class="solr.TextField" >>>>>> positionIncrementGap="100"> >>>>>> <analyzer type="index"> >>>>>> <charFilter class="solr.MappingCharFilterFactory" >>>>>> mapping="mapping-ISOLatin1Accent.txt"/> >>>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>>>> <!-- in this example, we will only use synonyms at query time >>>>>> <filter class="solr.SynonymFilterFactory" >>>>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> >>>>>> --> >>>>>> <!-- Case insensitive stop word removal. >>>>>> add enablePositionIncrements=true in both the index and query >>>>>> analyzers to leave a 'gap' for more accurate phrase queries. >>>>>> --> >>>>>> <filter class="solr.StopFilterFactory" >>>>>> ignoreCase="true" >>>>>> words="stopwords.txt" >>>>>> enablePositionIncrements="true" >>>>>> /> >>>>>> <filter class="solr.WordDelimiterFilterFactory" >>>>>> generateWordParts="1" >>>>>> generateNumberParts="1" >>>>>> catenateWords="1" >>>>>> catenateNumbers="1" >>>>>> catenateAll="0" >>>>>> splitOnCaseChange="1" >>>>>> preserveOriginal="1"/> >>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>> <filter class="solr.SnowballPorterFilterFactory" language="English" >>>>>> protected="protwords.txt"/> >>>>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >>>>>> </analyzer> >>>>>> <analyzer type="query"> >>>>>> <charFilter class="solr.MappingCharFilterFactory" >>>>>> mapping="mapping-ISOLatin1Accent.txt"/> >>>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>>>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" >>>>>> ignoreCase="true" expand="true"/> >>>>>> <filter class="solr.StopFilterFactory" >>>>>> ignoreCase="true" >>>>>> words="stopwords.txt" >>>>>> enablePositionIncrements="true" >>>>>> /> >>>>>> <filter class="solr.WordDelimiterFilterFactory" >>>>>> generateWordParts="1" >>>>>> generateNumberParts="1" >>>>>> catenateWords="0" >>>>>> catenateNumbers="0" >>>>>> catenateAll="0" >>>>>> splitOnCaseChange="1" >>>>>> preserveOriginal="1"/> >>>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>>> <filter class="solr.SnowballPorterFilterFactory" language="English" >>>>>> protected="protwords.txt"/> >>>>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >>>>>> </analyzer> >>>>>> </fieldType> >>>>>> >>>>>> >>>>>> There is also a type definition that is called text_ws, should I use that >>>>>> instead and change text to text_ws in the field definition for title? >>>>>> >>>>>> <!-- A text field that only splits on whitespace for exact matching of >>>>>> words --> >>>>>> <fieldType name="text_ws" class="solr.TextField" >>>>>> positionIncrementGap="100"> >>>>>> <analyzer> >>>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>>>> </analyzer> >>>>>> </fieldType> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Mvh >>>>>> >>>>>> Christopher Cato >>>>>> Teknikchef >>>>>> ----------------------------------- >>>>>> MiniMedia >>>>>> Phone: +46761927603 >>>>>> www.minimedia.se >>>>>> >>>>>> 7 jul 2011 kl. 23.16 skrev Briggs Thompson: >>>>>> >>>>>>> Hello Christopher, >>>>>>> >>>>>>> Can you provide the exact query sent to Solr for the one word query and >>>>>> also >>>>>>> the two word query? The field type definition for your title field would >>>>>> be >>>>>>> useful too. >>>>>>> >>>>>>> From what I understand, Solr should be able to handle your use case. I >>>>>>> am >>>>>>> guessing it is a problem with how the field is defined assuming the >>>>>>> query >>>>>> is >>>>>>> correct. >>>>>>> >>>>>>> Briggs Thompson >>>>>>> >>>>>>> On Thu, Jul 7, 2011 at 12:22 PM, Christopher Cato < >>>>>>> christopher.c...@minimedia.se> wrote: >>>>>>> >>>>>>>> Hi, I'm running Solr 3.2 with edismax under Tomcat 6 via Drupal. >>>>>>>> >>>>>>>> I'm having some problems writing a query that matches a specific field >>>>>> on >>>>>>>> several words. I have implemented an AJAX search that basically takes >>>>>>>> whatever is in a form field and attempts to match documents. I'm not >>>>>> having >>>>>>>> much luck though. First word always matches correctly but as soon as I >>>>>> enter >>>>>>>> the second word I'm loosing matches, the third word doesn't give any >>>>>> matches >>>>>>>> at all. >>>>>>>> >>>>>>>> The title field that I'm searching contains a product name that may or >>>>>> may >>>>>>>> not have several words. >>>>>>>> >>>>>>>> The requirement is that the search should be progressive i.e. as the >>>>>> user >>>>>>>> inputs words I should always return results that contain all of the >>>>>> words >>>>>>>> entered. I also have to correct bad input like an erraneous space in >>>>>>>> the >>>>>>>> product name ex. "product name" instead of "productname". >>>>>>>> >>>>>>>> I'm wondering if there isn't an easier way to query Solr? Ideally I'd >>>>>> want >>>>>>>> to say "give me all docs that have the following text in it's titles" >>>>>>>> Is >>>>>>>> that possible? >>>>>>>> >>>>>>>> >>>>>>>> I'd really appreciate any help! >>>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christopher Cato >>>>>> >>>>>> >>>> >>>> >> >>