Re: simple tokenizer question

Upayavira Sat, 07 Dec 2013 15:04:05 -0800

Have you tried a WhitespaceTokenizerFactory followed by the
WordDelimiterFilterFactory? The latter is perhaps more configurable at
what it does. Alternatively, you could use a RegexFilterFactory to
remove extraneous punctuation that wasn't removed by the Whitespace
Tokenizer.


Upayavira

On Sat, Dec 7, 2013, at 06:15 PM, Vulcanoid Developer wrote:
> Hi,
> 
> I am new to solr and I guess this is a basic tokenizer question so please
> bear with me.
> 
> I am trying to use SOLR to index a few (Indian) legal judgments in text
> form and search against them. One of the key points with these documents
> is
> that the sections/provisions of law usually have punctuation/special
> characters in them. For example search queries will TYPICALLY be section
> 12AA, section 80-IA, section 9(1)(vii) and the text of the judgments
> themselves will contain these sort of text with section references all
> over
> the place.
> 
> Now, using a default schema setup with standardtokenizer, which seems to
> delimit on whitespace AND punctuations, I get really bad results because
> it
> looks like 12AA is split and results such having 12 and AA in them turn
> up.
>  It becomes worse with 9(1)(vii) with results containing 9 and 1 etc
>  being
> turned up.
> 
> What is the best solution here? I really just want to index the document
> as-is and also to do whitespace tokenizing on the search and nothing
> more.
> 
> So in other words:
> a) I would like the text document to be indexed as-is with say 12AA and
> 9(1)(vii) in the document stored as it is mentioned.
> b) I would like to be able to search for 12AA and for 9(1)(vii) and get
> proper full matches on them without any splitting up/munging etc.
> 
> Any suggestions are appreciated.  Thank you for your time.
> 
> Thanks
> Vulcanoid

Re: simple tokenizer question

Reply via email to