Speaking of which, there is a spellchecker in jira that will detect word-break errors like this. See "WordBreakSpellChecker" at https://issues.apache.org/jira/browse/LUCENE-3523 .
To use it with Solr, you'd also need to apply SOLR-2993 (https://issues.apache.org/jira/browse/SOLR-2993). This Solr piece will take the results of your "normal" spellchecker and integrate them with the results from the WordBreakSpellChecker. These patches are for Trunk/4.x, and you'd have to apply them as described here: http://wiki.apache.org/solr/HowToContribute#Review.2BAC8-Improve_Existing_Patches I would appreiate it if you tried these out to provide feedback on the JIRA issues as to how it works for you and also how it can be improved. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -----Original Message----- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Thursday, March 01, 2012 9:59 AM To: solr-user@lucene.apache.org Subject: Re: Need tokenization that finds part of stringvalue I once used a spell checker to break up compound words. It was slow, but worked pretty well. wunder On Mar 1, 2012, at 5:53 AM, Erick Erickson wrote: > Right, there's nothing in Solr that I know of that'll help here. How would > a tokenizer understand that "smartphone" should be "smart" "phone"? > There's no general solution for this issue. > > You can do domain-specific solutions with synonyms for instance, or > some other word list that contains terms you're interested in, entries > like smartphone => smart phone > but that has the obvious drawback of requiring that you know all the > terms that might be smashed together. > > You *might* be able to do something with shingles, but I'm a little unclear > on how. > > Best > Erick > > On Tue, Feb 28, 2012 at 4:05 PM, PeterKerk <vettepa...@hotmail.com> wrote: >> I have the following in my schema.xml >> >> <field name="title" type="text_ws" indexed="true" stored="true"/> >> <field name="title_search" type="text" indexed="true" stored="true"/> >> >> >> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords_dutch.txt"/> >> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" >> generateNumberParts="1" catenateWords="1" catenateNumbers="1" >> catenateAll="0" splitOnCaseChange="1"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >> </analyzer> >> <analyzer type="query"> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" >> ignoreCase="true" expand="true"/> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords_dutch.txt"/> >> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" >> generateNumberParts="1" catenateWords="0" catenateNumbers="0" >> catenateAll="0" splitOnCaseChange="1"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> >> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >> </analyzer> >> </fieldType> >> >> >> I want to search on field "title". >> Now my field title holds the value "great smartphone". >> If I search on "smartphone" the item is found. But I want the item also to >> be found on "great" or "phone" it doesnt work. >> I have been playing around with the tokenizer test function, but have failed >> to find the definition for the "text" fieldtype I need. >> Help? :) >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3785366.html >> Sent from the Solr - User mailing list archive at Nabble.com. -- Walter Underwood wun...@wunderwood.org