Hey Ahmet, Yeah I had missed Shawn's response, I'll have to give that a try as well. As for the version, we're using 4.4. StandardTokenizer sets type for HANGUL, HIRAGANA, IDEOGRAPHIC, KATAKANA, and SOUTHEAST_ASIAN and you're right, we're using TypeTokenFilter to remove those.
Diego Fernandez - 爱国 Software Engineer US GSS Supportability - Diagnostics ----- Original Message ----- > Hi Diego, > > Did you miss Shawn's response? His ICUTokenizerFactory solution is better > than mine. > > By the way, what solr version are you using? Does StandardTokenizer set type > attribute for CJK words? > > To filter out given types, you not need a custom filter. Type Token filter > serves exactly that purpose. > https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-TypeTokenFilter > > > > On Tuesday, May 20, 2014 5:50 PM, Diego Fernandez <difer...@redhat.com> > wrote: > Great, thanks for the information! Right now we're using the > StandardTokenizer types to filter out CJK characters with a custom filter. > I'll test using MappingCharFilters, although I'm a little concerned with > possible adverse scenarios. > > Diego Fernandez - 爱国 > Software Engineer > US GSS Supportability - Diagnostics > > > > ----- Original Message ----- > > Hi Aiguofer, > > > > You mean ClassicTokenizer? Because StandardTokenizer does not set token > > types > > (e-mail, url, etc). > > > > > > I wouldn't go with the JFlex edit, mainly because maintenance costs. It > > will > > be a burden to maintain a custom tokenizer. > > > > MappingCharFilters could be used to manipulate tokenizer behavior. > > > > Just an example, if you don't want your tokenizer to break on hyphens, > > replace it with something that your tokenizer does not break. For example > > under score. > > > > "-" => "_" > > > > > > > > Plus WDF can be customized too. Please see types attribute : > > > > http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/wdftypes.txt > > > > > > Ahmet > > > > > > On Friday, May 16, 2014 6:24 PM, aiguofer <difer...@redhat.com> wrote: > > Jack Krupansky-2 wrote > > > > > Typically the white space tokenizer is the best choice when the word > > > delimiter filter will be used. > > > > > > -- Jack Krupansky > > > > If we wanted to keep the StandardTokenizer (because we make use of the > > token > > types) but wanted to use the WDFF to get combinations of words that are > > split with certain characters (mainly - and /, but possibly others as > > well), > > what is the suggested way of accomplishing this? Would we just have to > > extend the JFlex file for the tokenizer and re-compile it? > > > > > > > > -- > > View this message in context: > > http://lucene.472066.n3.nabble.com/WordDelimiterFilterFactory-and-StandardTokenizer-tp4131628p4136146.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > > > >