On Tue, Nov 30, 2010 at 10:07 AM, Robert Muir <rcm...@gmail.com> wrote:
> On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder <jel...@locamoda.com> wrote: > > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the > > past, we were using a patched version of StandardTokenizer which treated > > @twitteruser and #hashtag better, but this became a release engineering > > nightmare so we switched to Whitespace. > > in this case, have you considered using a CharFilter (e.g. > MappingCharFilter) before the tokenizer? > > This way you could map your special things such as @ and # to some > other string that the tokenizer doesnt split on, > e.g. # => "HASH_". > > then your #foobar goes to HASH_foobar. > If you want searches of "#foobar" to only match "#foobar" and not also > "foobar" itself, and vice versa, you are done. > Maybe you want searches of #foobar to only match #foobar, but searches > of "foobar" to match both "#foobar" and "foobar". > In this case, you would probably use a worddelimiterfilter w/ > preserveOriginal at index-time only , followed by a StopFilter > containing HASH, so you index HASH_foobar and foobar. > > anyway i think you have a lot of flexibility to reuse > standardtokenizer but customize things like this without maintaining > your own tokenizer, this is the purpose of CharFilters. > That worked brilliantly. Thank you very much, Robert. -- Jacob Elder @jelder (646) 535-3379