Hi Torsten, The Lucene StandardTokenizer is written in JFlex (http://jflex.de) - you can see the version 3.X specification at:
<http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex?view=markup> You can make changes to this file, then run "ant jflex-StandardAnalyzer" from the checked-out branch_3x sources or a source release (in the lucene/core/ directory in branch_3x, and in the lucene/ directory in a pre-3.6 source release), to generate the corresponding java source code at: lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.java However, I recommend a simpler strategy: use a MappingCharFilter[1] in front of your tokenizer to map the tokens you want left intact to strings that will not be broken up by the tokenizer. For example, Lucene-Core could be mapped to Lucene_Core, because UAX#29[2], upon which StandardTokenizer is based, considers the underscore to be a "word" character, and so will leave Lucene_Core as a single token. You would need to use this strategy at both index-time and query-time. (I was going to add that if you wanted your indexed tokens to be the same as their original form, you could add a MappingTokenFilter after your tokenizer to do the reverse mapping, but such a thing does not yet exist :( - however, there is a JIRA issue for this idea: <https://issues.apache.org/jira/browse/SOLR-1978>.) Steve [1] <http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/MappingCharFilter.html> [2] http://unicode.org/reports/tr29/ > -----Original Message----- > From: Torsten Krah [mailto:tk...@fachschaft.imn.htwk-leipzig.de] > Sent: Friday, February 17, 2012 9:15 AM > To: solr-user@lucene.apache.org > Subject: customizing standard tokenizer > > Hi, > > is it possible to extend the standard tokenizer or use a custom one > (possible via extending the standard one) to add some "custom" tokens > like Lucene-Core to be "one" token. > > regards