Have you tried a WhitespaceTokenizerFactory followed by the WordDelimiterFilterFactory? The latter is perhaps more configurable at what it does. Alternatively, you could use a RegexFilterFactory to remove extraneous punctuation that wasn't removed by the Whitespace Tokenizer.
Upayavira On Sat, Dec 7, 2013, at 06:15 PM, Vulcanoid Developer wrote: > Hi, > > I am new to solr and I guess this is a basic tokenizer question so please > bear with me. > > I am trying to use SOLR to index a few (Indian) legal judgments in text > form and search against them. One of the key points with these documents > is > that the sections/provisions of law usually have punctuation/special > characters in them. For example search queries will TYPICALLY be section > 12AA, section 80-IA, section 9(1)(vii) and the text of the judgments > themselves will contain these sort of text with section references all > over > the place. > > Now, using a default schema setup with standardtokenizer, which seems to > delimit on whitespace AND punctuations, I get really bad results because > it > looks like 12AA is split and results such having 12 and AA in them turn > up. > It becomes worse with 9(1)(vii) with results containing 9 and 1 etc > being > turned up. > > What is the best solution here? I really just want to index the document > as-is and also to do whitespace tokenizing on the search and nothing > more. > > So in other words: > a) I would like the text document to be indexed as-is with say 12AA and > 9(1)(vii) in the document stored as it is mentioned. > b) I would like to be able to search for 12AA and for 9(1)(vii) and get > proper full matches on them without any splitting up/munging etc. > > Any suggestions are appreciated. Thank you for your time. > > Thanks > Vulcanoid