Re: WordDelimiterFilterFactory and StandardTokenizer

Diego Fernandez Tue, 20 May 2014 07:50:51 -0700

Great, thanks for the information!  Right now we're using the StandardTokenizer 
types to filter out CJK characters with a custom filter.  I'll test using 
MappingCharFilters, although I'm a little concerned with possible adverse 
scenarios.


Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


----- Original Message -----
> Hi Aiguofer,
> 
> You mean ClassicTokenizer? Because StandardTokenizer does not set token types
> (e-mail, url, etc).
> 
> 
> I wouldn't go with the JFlex edit, mainly because maintenance costs. It will
> be a burden to maintain a custom tokenizer.
> 
> MappingCharFilters could be used to manipulate tokenizer behavior.
> 
> Just an example, if you don't want your tokenizer to break on hyphens,
> replace it with something that your tokenizer does not break. For example
> under score.
> 
> "-" => "_"
> 
> 
> 
> Plus WDF can be customized too. Please see types attribute :
> 
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/wdftypes.txt
> 
>  
> Ahmet
> 
> 
> On Friday, May 16, 2014 6:24 PM, aiguofer <[email protected]> wrote:
> Jack Krupansky-2 wrote
> 
> > Typically the white space tokenizer is the best choice when the word
> > delimiter filter will be used.
> > 
> > -- Jack Krupansky
> 
> If we wanted to keep the StandardTokenizer (because we make use of the token
> types) but wanted to use the WDFF to get combinations of words that are
> split with certain characters (mainly - and /, but possibly others as well),
> what is the suggested way of accomplishing this? Would we just have to
> extend the JFlex file for the tokenizer and re-compile it?
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/WordDelimiterFilterFactory-and-StandardTokenizer-tp4131628p4136146.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
>

Re: WordDelimiterFilterFactory and StandardTokenizer

Reply via email to