Re: WordDelimiterFilterFactory and StandardTokenizer

Diego Fernandez Tue, 20 May 2014 09:06:51 -0700

Hey Ahmet, 

Yeah I had missed Shawn's response, I'll have to give that a try as well.  As 
for the version, we're using 4.4.  StandardTokenizer sets type for HANGUL, 
HIRAGANA, IDEOGRAPHIC, KATAKANA, and SOUTHEAST_ASIAN and you're right, we're 
using TypeTokenFilter to remove those.


Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics


----- Original Message -----
> Hi Diego,
> 
> Did you miss Shawn's response? His ICUTokenizerFactory solution is better
> than mine.
> 
> By the way, what solr version are you using? Does StandardTokenizer set type
> attribute for CJK words?
> 
> To filter out given types, you not need a custom filter. Type Token filter
> serves exactly that purpose.
> https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-TypeTokenFilter
> 
> 
> 
> On Tuesday, May 20, 2014 5:50 PM, Diego Fernandez <difer...@redhat.com>
> wrote:
> Great, thanks for the information!  Right now we're using the
> StandardTokenizer types to filter out CJK characters with a custom filter.
>   I'll test using MappingCharFilters, although I'm a little concerned with
> possible adverse scenarios.
> 
> Diego Fernandez - 爱国
> Software Engineer
> US GSS Supportability - Diagnostics
> 
> 
> 
> ----- Original Message -----
> > Hi Aiguofer,
> > 
> > You mean ClassicTokenizer? Because StandardTokenizer does not set token
> > types
> > (e-mail, url, etc).
> > 
> > 
> > I wouldn't go with the JFlex edit, mainly because maintenance costs. It
> > will
> > be a burden to maintain a custom tokenizer.
> > 
> > MappingCharFilters could be used to manipulate tokenizer behavior.
> > 
> > Just an example, if you don't want your tokenizer to break on hyphens,
> > replace it with something that your tokenizer does not break. For example
> > under score.
> > 
> > "-" => "_"
> > 
> > 
> > 
> > Plus WDF can be customized too. Please see types attribute :
> > 
> > http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/collection1/conf/wdftypes.txt
> > 
> >  
> > Ahmet
> > 
> > 
> > On Friday, May 16, 2014 6:24 PM, aiguofer <difer...@redhat.com> wrote:
> > Jack Krupansky-2 wrote
> > 
> > > Typically the white space tokenizer is the best choice when the word
> > > delimiter filter will be used.
> > > 
> > > -- Jack Krupansky
> > 
> > If we wanted to keep the StandardTokenizer (because we make use of the
> > token
> > types) but wanted to use the WDFF to get combinations of words that are
> > split with certain characters (mainly - and /, but possibly others as
> > well),
> > what is the suggested way of accomplishing this? Would we just have to
> > extend the JFlex file for the tokenizer and re-compile it?
> > 
> > 
> > 
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/WordDelimiterFilterFactory-and-StandardTokenizer-tp4131628p4136146.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> > 
> >
>

Re: WordDelimiterFilterFactory and StandardTokenizer

Reply via email to