Hi, I am new to solr and I guess this is a basic tokenizer question so please bear with me.
I am trying to use SOLR to index a few (Indian) legal judgments in text form and search against them. One of the key points with these documents is that the sections/provisions of law usually have punctuation/special characters in them. For example search queries will TYPICALLY be section 12AA, section 80-IA, section 9(1)(vii) and the text of the judgments themselves will contain these sort of text with section references all over the place. Now, using a default schema setup with standardtokenizer, which seems to delimit on whitespace AND punctuations, I get really bad results because it looks like 12AA is split and results such having 12 and AA in them turn up. It becomes worse with 9(1)(vii) with results containing 9 and 1 etc being turned up. What is the best solution here? I really just want to index the document as-is and also to do whitespace tokenizing on the search and nothing more. So in other words: a) I would like the text document to be indexed as-is with say 12AA and 9(1)(vii) in the document stored as it is mentioned. b) I would like to be able to search for 12AA and for 9(1)(vii) and get proper full matches on them without any splitting up/munging etc. Any suggestions are appreciated. Thank you for your time. Thanks Vulcanoid