Preserving "C++" and other weird tokens

Michael _ Thu, 06 Aug 2009 08:39:12 -0700

Hi everyone,
I'm indexing several documents that contain words that the StandardTokenizer
cannot detect as tokens.  These are words like
  C#
  .NET
  C++
which are important for users to be able to search for, but get treated as
"C", "NET", and "C".


How can I create a list of words that should be understood to be indivisible
tokens?  Is my only option somehow stringing together a lot of
PatternTokenizers?  I'd love to do something like <tokenizer
class="StandardTokenizer" tokenwhitelist=".NET C++ C#" />.

Thanks in advance!

Preserving "C++" and other weird tokens

Reply via email to