Re: Sub-Sequence token filter

Nitzan Shaked Sun, 18 May 2014 01:46:32 -0700

URLClassifyProcessorFactory doesn't do anything near what I need. It
analyzes URLs and outputs broken-down data.


I am going to create a JIRA ticket for my filter, with the hopes that
someone finds it useful.


On Fri, May 16, 2014 at 10:38 PM, Ahmet Arslan <iori...@yahoo.com> wrote:

> Hi,
>
> I don't have system that searches on URLs. So I don't fully follow.
> But I remember people use URLClassifyProcessorFactory
>
>
>
> On Friday, May 16, 2014 8:33 PM, Nitzan Shaked <nitzan.sha...@gmail.com>
> wrote:
> Doesn't look like it. If I understand it correctly,
> PathHierarchyTokenizerFactory
> will only output prefixes. I support suffixes as well, plus the
> ever-so-useful "unanchored" sub-sequences. Using domains again as an
> example, I can use my suggestion to query "market.ebay" and find "
> www.market.ebay.com" (domains completely made up for the sake of this
> example).
>
>
>
> On Fri, May 16, 2014 at 7:53 PM, Ahmet Arslan <iori...@yahoo.com> wrote:
>
> > Hi Nitzan,
> >
> > Cant you do what you described with PathHierarchyTokenizerFactory?
> >
> >
> >
> http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/path/PathHierarchyTokenizerFactory.html
> >
> > Ahmet
> >
> >
> >
> >
> >
> > On Friday, May 16, 2014 5:13 PM, Nitzan Shaked <nitzan.sha...@gmail.com>
> > wrote:
> > Hi list
> >
> > I created a small token filter which I'd gladly "contribute", but want to
> > know if there's any interest in it before I go and make it pretty, add
> > documentation, etc... ;)
> >
> > I originally created it to index domain names: I wanted to be able to
> > search for "google.com" and find "www.google.com" or "ads.google.com", "
> > mail.google.com", etc.
> >
> > What it does is split a token (in my case -- according to "."), and then
> > outputs all sub-sequences. So "a,b,c,d" will output "a", "b", "c", "d",
> > "a.b", "b.c", "c.d", "a.b.c", "b.c.d", and "a.b.c.d". I use it only in
> the
> > "index" analyzer, and so am able to specify any of the generated tokens
> to
> > find the original token.
> >
> > It has the following arguments:
> >
> > sepRegexp: regular expression that the original token will be split
> > according to. (I use "[.]" for domains)
> > glue: string that will be used to join sub-sequences back together (I use
> > "." for domains)
> > minLen: minimum generated sub-sequence length
> > maxLen: maximum generated sub-sequence length (0 for unlimited, negative
> > numbers for token length minus specified amount)
> > anchor: "start" to only output prefixes, "end" to only output suffix, or
> > "none" to output any sub-sequence
> >
> > So... is this useful to anyone?
> >
> >
>
>

Re: Sub-Sequence token filter

Reply via email to