On Tue, Nov 30, 2010 at 10:07 AM, Robert Muir <rcm...@gmail.com> wrote:

> On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder <jel...@locamoda.com> wrote:
> > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
> > past, we were using a patched version of StandardTokenizer which treated
> > @twitteruser and #hashtag better, but this became a release engineering
> > nightmare so we switched to Whitespace.
>
> in this case, have you considered using a CharFilter (e.g.
> MappingCharFilter) before the tokenizer?
>
> This way you could map your special things such as @ and # to some
> other string that the tokenizer doesnt split on,
> e.g. # => "HASH_".
>
> then your #foobar goes to HASH_foobar.
> If you want searches of "#foobar" to only match "#foobar" and not also
> "foobar" itself, and vice versa, you are done.
> Maybe you want searches of #foobar to only match #foobar, but searches
> of "foobar" to match both "#foobar" and "foobar".
> In this case, you would probably use a worddelimiterfilter w/
> preserveOriginal at index-time only , followed by a StopFilter
> containing HASH, so you index HASH_foobar and foobar.
>
> anyway i think you have a lot of flexibility to reuse
> standardtokenizer but customize things like this without maintaining
> your own tokenizer, this is the purpose of CharFilters.
>

That worked brilliantly. Thank you very much, Robert.

-- 
Jacob Elder
@jelder
(646) 535-3379

Reply via email to