simple tokenizer question

Vulcanoid Developer Sat, 07 Dec 2013 10:16:23 -0800

Hi,

I am new to solr and I guess this is a basic tokenizer question so please
bear with me.


I am trying to use SOLR to index a few (Indian) legal judgments in text
form and search against them. One of the key points with these documents is
that the sections/provisions of law usually have punctuation/special
characters in them. For example search queries will TYPICALLY be section
12AA, section 80-IA, section 9(1)(vii) and the text of the judgments
themselves will contain these sort of text with section references all over
the place.

Now, using a default schema setup with standardtokenizer, which seems to
delimit on whitespace AND punctuations, I get really bad results because it
looks like 12AA is split and results such having 12 and AA in them turn up.
 It becomes worse with 9(1)(vii) with results containing 9 and 1 etc being
turned up.

What is the best solution here? I really just want to index the document
as-is and also to do whitespace tokenizing on the search and nothing more.

So in other words:
a) I would like the text document to be indexed as-is with say 12AA and
9(1)(vii) in the document stored as it is mentioned.
b) I would like to be able to search for 12AA and for 9(1)(vii) and get
proper full matches on them without any splitting up/munging etc.

Any suggestions are appreciated.  Thank you for your time.

Thanks
Vulcanoid

simple tokenizer question

Reply via email to