Sorry for the delay... take a look at the URL Classify update processor,
which parses a URL and distributes the components to various fields:
http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessorFactory.html
http://lucene.apache.org/solr/4_9_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessor.html
The official doc is... pitiful, but I have doc and examples in my e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html
-- Jack Krupansky
-----Original Message-----
From: Sathyam
Sent: Thursday, August 28, 2014 6:21 AM
To: solr-user@lucene.apache.org
Subject: Re: Query regarding URL Analysers
Gentle Reminder
On 21 August 2014 18:05, Sathyam <sathyam.dorasw...@gmail.com> wrote:
Hi,
I needed to generate tokens out of a URL such that I am able to get
hierarchical units of the URL as well as each individual entity as tokens.
For example:
*Given a URL : *
http://www.google.com/abcd/efgh/ijkl/mnop.php?a=10&b=20&c=30#xyz
The tokens that I need are :
*Hierarchical subsets of the URL*
1 http://
2 http://www.google.com/
3 http://www.google.com/abcd/
4 http://www.google.com/abcd/efgh/
5 http://www.google.com/abcd/efgh/ijkl/
6 h ttp://www.google.com/abcd/efgh/ijkl/mnop.php
*Individual elements in the path to the resource*
7 abcd
8 efgh
9 ijkl
10 mnop.php
*Query Terms*
11 a=10
12 b=20
13 c=30
*Fragment*
14 xyz
This comes to a total of 14 tokens for the given URL.
Basically a URL analyzer that creates tokens based on the categories
mentioned in bold. Also a separate token for port(if mentioned).
I would like to know how this can be achieved by using a single analyzer
that uses a combination of the tokenizers and filters provided by solr.
Also curious to know why there is a restriction of only *one *tokenizer
to be used in an analyzer.
Looking forward to a response from your side telling the best possible way
to achieve the closest to what I need.
Thanks.
--
Sathyam Doraswamy
--
Sathyam Doraswamy