Re: Sub-Sequence token filter

Ahmet Arslan Fri, 16 May 2014 11:56:33 -0700

Hi Nitzan,

Cant you do what you described with PathHierarchyTokenizerFactory?


http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/path/PathHierarchyTokenizerFactory.html

Ahmet





On Friday, May 16, 2014 5:13 PM, Nitzan Shaked <nitzan.sha...@gmail.com> wrote:
Hi list

I created a small token filter which I'd gladly "contribute", but want to
know if there's any interest in it before I go and make it pretty, add
documentation, etc... ;)

I originally created it to index domain names: I wanted to be able to
search for "google.com" and find "www.google.com" or "ads.google.com", "
mail.google.com", etc.

What it does is split a token (in my case -- according to "."), and then
outputs all sub-sequences. So "a,b,c,d" will output "a", "b", "c", "d",
"a.b", "b.c", "c.d", "a.b.c", "b.c.d", and "a.b.c.d". I use it only in the
"index" analyzer, and so am able to specify any of the generated tokens to
find the original token.

It has the following arguments:

sepRegexp: regular expression that the original token will be split
according to. (I use "[.]" for domains)
glue: string that will be used to join sub-sequences back together (I use
"." for domains)
minLen: minimum generated sub-sequence length
maxLen: maximum generated sub-sequence length (0 for unlimited, negative
numbers for token length minus specified amount)
anchor: "start" to only output prefixes, "end" to only output suffix, or
"none" to output any sub-sequence

So... is this useful to anyone?

Re: Sub-Sequence token filter

Reply via email to