RE: customizing standard tokenizer

Steven A Rowe Fri, 17 Feb 2012 12:04:18 -0800

Hi Torsten,

The Lucene StandardTokenizer is written in JFlex (http://jflex.de) - you can 
see the version 3.X specification at:


<http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup>

You can make changes to this file, then run "ant jflex-StandardAnalyzer" from 
the checked-out branch_3x sources or a source release (in the lucene/core/ 
directory in branch_3x, and in the lucene/ directory in a pre-3.6 source 
release), to generate the corresponding java source code at:

  
lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java

However, I recommend a simpler strategy: use a MappingCharFilter[1] in front of 
your tokenizer to map the tokens you want left intact to strings that will not 
be broken up by the tokenizer.  For example, Lucene-Core could be mapped to 
Lucene_Core, because UAX#29[2], upon which StandardTokenizer is based, 
considers the underscore to be a "word" character, and so will leave 
Lucene_Core as a single token.  You would need to use this strategy at both 
index-time and query-time.

(I was going to add that if you wanted your indexed tokens to be the same as 
their original form, you could add a MappingTokenFilter after your tokenizer to 
do the reverse mapping, but such a thing does not yet exist :( - however, there 
is a JIRA issue for this idea: 
<https://issues.apache.org/jira/browse/SOLR-1978>.)

Steve

[1] 
<http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/MappingCharFilter.html>

[2] http://unicode.org/reports/tr29/

> -----Original Message-----
> From: Torsten Krah [mailto:tk...@fachschaft.imn.htwk-leipzig.de]
> Sent: Friday, February 17, 2012 9:15 AM
> To: solr-user@lucene.apache.org
> Subject: customizing standard tokenizer
> 
> Hi,
> 
> is it possible to extend the standard tokenizer or use a custom one
> (possible via extending the standard one) to add some "custom" tokens
> like Lucene-Core to be "one" token.
> 
> regards

RE: customizing standard tokenizer

Reply via email to