Great, thanks for the information! Right now we're using the StandardTokenizer types to filter out CJK characters with a custom filter. I'll test using MappingCharFilters, although I'm a little concerned with possible adverse scenarios.
Diego Fernandez - 爱国 Software Engineer US GSS Supportability - Diagnostics ----- Original Message ----- > Hi Aiguofer, > > You mean ClassicTokenizer? Because StandardTokenizer does not set token types > (e-mail, url, etc). > > > I wouldn't go with the JFlex edit, mainly because maintenance costs. It will > be a burden to maintain a custom tokenizer. > > MappingCharFilters could be used to manipulate tokenizer behavior. > > Just an example, if you don't want your tokenizer to break on hyphens, > replace it with something that your tokenizer does not break. For example > under score. > > "-" => "_" > > > > Plus WDF can be customized too. Please see types attribute : > > http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/wdftypes.txt > > > Ahmet > > > On Friday, May 16, 2014 6:24 PM, aiguofer <[email protected]> wrote: > Jack Krupansky-2 wrote > > > Typically the white space tokenizer is the best choice when the word > > delimiter filter will be used. > > > > -- Jack Krupansky > > If we wanted to keep the StandardTokenizer (because we make use of the token > types) but wanted to use the WDFF to get combinations of words that are > split with certain characters (mainly - and /, but possibly others as well), > what is the suggested way of accomplishing this? Would we just have to > extend the JFlex file for the tokenizer and re-compile it? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/WordDelimiterFilterFactory-and-StandardTokenizer-tp4131628p4136146.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
