Re: WordDelimiterFilterFactory and StandardTokenizer

Ahmet Arslan Tue, 20 May 2014 08:27:37 -0700

Hi Diego,

Did you miss Shawn's response? His ICUTokenizerFactory solution is better than 
mine.


By the way, what solr version are you using? Does StandardTokenizer set type 
attribute for CJK words?

To filter out given types, you not need a custom filter. Type Token filter 
serves exactly that purpose.
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-TypeTokenFilter



On Tuesday, May 20, 2014 5:50 PM, Diego Fernandez <difer...@redhat.com> wrote:
Great, thanks for the information!  Right now we're using the StandardTokenizer 
types to filter out CJK characters with a custom filter.  I'll test using 
MappingCharFilters, although I'm a little concerned with possible adverse 
scenarios.  

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics



----- Original Message -----
> Hi Aiguofer,
> 
> You mean ClassicTokenizer? Because StandardTokenizer does not set token types
> (e-mail, url, etc).
> 
> 
> I wouldn't go with the JFlex edit, mainly because maintenance costs. It will
> be a burden to maintain a custom tokenizer.
> 
> MappingCharFilters could be used to manipulate tokenizer behavior.
> 
> Just an example, if you don't want your tokenizer to break on hyphens,
> replace it with something that your tokenizer does not break. For example
> under score.
> 
> "-" => "_"
> 
> 
> 
> Plus WDF can be customized too. Please see types attribute :
> 
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/wdftypes.txt
> 
>  
> Ahmet
> 
> 
> On Friday, May 16, 2014 6:24 PM, aiguofer <difer...@redhat.com> wrote:
> Jack Krupansky-2 wrote
> 
> > Typically the white space tokenizer is the best choice when the word
> > delimiter filter will be used.
> > 
> > -- Jack Krupansky
> 
> If we wanted to keep the StandardTokenizer (because we make use of the token
> types) but wanted to use the WDFF to get combinations of words that are
> split with certain characters (mainly - and /, but possibly others as well),
> what is the suggested way of accomplishing this? Would we just have to
> extend the JFlex file for the tokenizer and re-compile it?
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/WordDelimiterFilterFactory-and-StandardTokenizer-tp4131628p4136146.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
>

Re: WordDelimiterFilterFactory and StandardTokenizer

Reply via email to