UAX29URLEmailTokenizer recognizes URLs (among other things) - you could start 
with its JFlex grammar and modify it to do what you want.

Steve
www.lucidworks.com
 
On Aug 21, 2014, at 8:35 AM, Sathyam <sathyam.dorasw...@gmail.com> wrote:

> Hi,
> 
> I needed to generate tokens out of a URL such that I am able to get
> hierarchical units of the URL as well as each individual entity as tokens.
> For example:
> *Given a URL : *
> 
> http://www.google.com/abcd/efgh/ijkl/mnop.php?a=10&b=20&c=30#xyz
> 
> The tokens that I need are :
> 
> *Hierarchical subsets of the URL*
> 
> 1 http://
> 
> 2 http://www.google.com/
> 
> 3 http://www.google.com/abcd/
> 
> 4 http://www.google.com/abcd/efgh/
> 
> 5 http://www.google.com/abcd/efgh/ijkl/
> 
> 6 h ttp://www.google.com/abcd/efgh/ijkl/mnop.php
> 
> *Individual elements in the path to the resource*
> 
> 7 abcd
> 
> 8 efgh
> 
> 9 ijkl
> 
> 10 mnop.php
> 
> *Query Terms*
> 
> 11 a=10
> 
> 12 b=20
> 
> 13 c=30
> 
> *Fragment*
> 14 xyz
> 
> This comes to a total of 14 tokens for the given URL.
> Basically a URL analyzer that creates tokens based on the categories
> mentioned in bold. Also a separate token for port(if mentioned).
> 
> I would like to know how this can be achieved by using a single analyzer
> that uses a combination of the tokenizers and filters provided by solr.
> Also curious to know why there is a restriction of only *one  *tokenizer to
> be used in an analyzer.
> Looking forward to a response from your side telling the best possible way
> to achieve the closest to what I need.
> 
> Thanks.
> -- 
> Sathyam Doraswamy

Reply via email to