RE: Need tokenization that finds part of stringvalue

Dyer, James Thu, 01 Mar 2012 08:07:43 -0800

Speaking of which, there is a spellchecker in jira that will detect word-break 
errors like this.  See "WordBreakSpellChecker" at 
https://issues.apache.org/jira/browse/LUCENE-3523 .


To use it with Solr, you'd also need to apply SOLR-2993 
(https://issues.apache.org/jira/browse/SOLR-2993).  This Solr piece will take 
the results of your "normal" spellchecker and integrate them with the results 
from the WordBreakSpellChecker.  

These patches are for Trunk/4.x, and you'd have to apply them as described 
here:  
http://wiki.apache.org/solr/HowToContribute#Review.2BAC8-Improve_Existing_Patches

I would appreiate it if you tried these out to provide feedback on the JIRA 
issues as to how it works for you and also how it can be improved.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Thursday, March 01, 2012 9:59 AM
To: solr-user@lucene.apache.org
Subject: Re: Need tokenization that finds part of stringvalue

I once used a spell checker to break up compound words. It was slow, but worked 
pretty well.

wunder

On Mar 1, 2012, at 5:53 AM, Erick Erickson wrote:

> Right, there's nothing in Solr that I know of that'll help here. How would
> a tokenizer understand that "smartphone" should be "smart" "phone"?
> There's no general solution for this issue.
> 
> You can do domain-specific solutions with synonyms for instance, or
> some other word list that contains terms you're interested in, entries
> like smartphone => smart phone
> but that has the obvious drawback of requiring that you know all the
> terms that might be smashed together.
> 
> You *might* be able to do something with shingles, but I'm a little unclear
> on how.
> 
> Best
> Erick
> 
> On Tue, Feb 28, 2012 at 4:05 PM, PeterKerk <vettepa...@hotmail.com> wrote:
>> I have the following in my schema.xml
>> 
>> <field name="title" type="text_ws" indexed="true" stored="true"/>
>> <field name="title_search" type="text" indexed="true" stored="true"/>
>> 
>> 
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>  <analyzer type="index">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_dutch.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  </analyzer>
>>  <analyzer type="query">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_dutch.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>> 
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>  </analyzer>
>> </fieldType>
>> 
>> 
>> I want to search on field "title".
>> Now my field title holds the value "great smartphone".
>> If I search on "smartphone" the item is found. But I want the item also to
>> be found on "great" or "phone" it doesnt work.
>> I have been playing around with the tokenizer test function, but have failed
>> to find the definition for the "text" fieldtype I need.
>> Help? :)
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Need-tokenization-that-finds-part-of-stringvalue-tp3785366p3785366.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
wun...@wunderwood.org

RE: Need tokenization that finds part of stringvalue

Reply via email to