URLClassifyProcessorFactory doesn't do anything near what I need. It analyzes URLs and outputs broken-down data.
I am going to create a JIRA ticket for my filter, with the hopes that someone finds it useful. On Fri, May 16, 2014 at 10:38 PM, Ahmet Arslan <iori...@yahoo.com> wrote: > Hi, > > I don't have system that searches on URLs. So I don't fully follow. > But I remember people use URLClassifyProcessorFactory > > > > On Friday, May 16, 2014 8:33 PM, Nitzan Shaked <nitzan.sha...@gmail.com> > wrote: > Doesn't look like it. If I understand it correctly, > PathHierarchyTokenizerFactory > will only output prefixes. I support suffixes as well, plus the > ever-so-useful "unanchored" sub-sequences. Using domains again as an > example, I can use my suggestion to query "market.ebay" and find " > www.market.ebay.com" (domains completely made up for the sake of this > example). > > > > On Fri, May 16, 2014 at 7:53 PM, Ahmet Arslan <iori...@yahoo.com> wrote: > > > Hi Nitzan, > > > > Cant you do what you described with PathHierarchyTokenizerFactory? > > > > > > > http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/path/PathHierarchyTokenizerFactory.html > > > > Ahmet > > > > > > > > > > > > On Friday, May 16, 2014 5:13 PM, Nitzan Shaked <nitzan.sha...@gmail.com> > > wrote: > > Hi list > > > > I created a small token filter which I'd gladly "contribute", but want to > > know if there's any interest in it before I go and make it pretty, add > > documentation, etc... ;) > > > > I originally created it to index domain names: I wanted to be able to > > search for "google.com" and find "www.google.com" or "ads.google.com", " > > mail.google.com", etc. > > > > What it does is split a token (in my case -- according to "."), and then > > outputs all sub-sequences. So "a,b,c,d" will output "a", "b", "c", "d", > > "a.b", "b.c", "c.d", "a.b.c", "b.c.d", and "a.b.c.d". I use it only in > the > > "index" analyzer, and so am able to specify any of the generated tokens > to > > find the original token. > > > > It has the following arguments: > > > > sepRegexp: regular expression that the original token will be split > > according to. (I use "[.]" for domains) > > glue: string that will be used to join sub-sequences back together (I use > > "." for domains) > > minLen: minimum generated sub-sequence length > > maxLen: maximum generated sub-sequence length (0 for unlimited, negative > > numbers for token length minus specified amount) > > anchor: "start" to only output prefixes, "end" to only output suffix, or > > "none" to output any sub-sequence > > > > So... is this useful to anyone? > > > > > >