Doesn't look like it. If I understand it correctly, PathHierarchyTokenizerFactory will only output prefixes. I support suffixes as well, plus the ever-so-useful "unanchored" sub-sequences. Using domains again as an example, I can use my suggestion to query "market.ebay" and find " www.market.ebay.com" (domains completely made up for the sake of this example).
On Fri, May 16, 2014 at 7:53 PM, Ahmet Arslan <iori...@yahoo.com> wrote: > Hi Nitzan, > > Cant you do what you described with PathHierarchyTokenizerFactory? > > > http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/path/PathHierarchyTokenizerFactory.html > > Ahmet > > > > > > On Friday, May 16, 2014 5:13 PM, Nitzan Shaked <nitzan.sha...@gmail.com> > wrote: > Hi list > > I created a small token filter which I'd gladly "contribute", but want to > know if there's any interest in it before I go and make it pretty, add > documentation, etc... ;) > > I originally created it to index domain names: I wanted to be able to > search for "google.com" and find "www.google.com" or "ads.google.com", " > mail.google.com", etc. > > What it does is split a token (in my case -- according to "."), and then > outputs all sub-sequences. So "a,b,c,d" will output "a", "b", "c", "d", > "a.b", "b.c", "c.d", "a.b.c", "b.c.d", and "a.b.c.d". I use it only in the > "index" analyzer, and so am able to specify any of the generated tokens to > find the original token. > > It has the following arguments: > > sepRegexp: regular expression that the original token will be split > according to. (I use "[.]" for domains) > glue: string that will be used to join sub-sequences back together (I use > "." for domains) > minLen: minimum generated sub-sequence length > maxLen: maximum generated sub-sequence length (0 for unlimited, negative > numbers for token length minus specified amount) > anchor: "start" to only output prefixes, "end" to only output suffix, or > "none" to output any sub-sequence > > So... is this useful to anyone? > >