Re: Need tokenization that finds part of stringvalue

Walter Underwood Thu, 01 Mar 2012 07:59:43 -0800

I once used a spell checker to break up compound words. It was slow, but worked 
pretty well.


wunder

On Mar 1, 2012, at 5:53 AM, Erick Erickson wrote:

> Right, there's nothing in Solr that I know of that'll help here. How would
> a tokenizer understand that "smartphone" should be "smart" "phone"?
> There's no general solution for this issue.
> 
> You can do domain-specific solutions with synonyms for instance, or
> some other word list that contains terms you're interested in, entries
> like smartphone => smart phone
> but that has the obvious drawback of requiring that you know all the
> terms that might be smashed together.
> 
> You *might* be able to do something with shingles, but I'm a little unclear
> on how.
> 
> Best
> Erick
> 
> On Tue, Feb 28, 2012 at 4:05 PM, PeterKerk <vettepa...@hotmail.com> wrote:
>> I have the following in my schema.xml
>> 
>> <field name="title" type="text_ws" indexed="true" stored="true"/>
>> <field name="title_search" type="text" indexed="true" stored="true"/>
>> 
>> 
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>  <analyzer type="index">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_dutch.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  </analyzer>
>>  <analyzer type="query">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_dutch.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>> 
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  </analyzer>
>> </fieldType>
>> 
>> 
>> I want to search on field "title".
>> Now my field title holds the value "great smartphone".
>> If I search on "smartphone" the item is found. But I want the item also to
>> be found on "great" or "phone" it doesnt work.
>> I have been playing around with the tokenizer test function, but have failed
>> to find the definition for the "text" fieldtype I need.
>> Help? :)
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3785366.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
wun...@wunderwood.org

Re: Need tokenization that finds part of stringvalue

Reply via email to