If you want to just split on whitespace, then the WhitespaceTokenizer will do the job.
However, this will mean that these two tokens aren't the same, and won't match each other: cat cat. A simple regex filter could handle those cases, remove a comma or dot when at the end of a word. Although there are other similar situations (quotes, colons, etc) that you may want to handle eventually. Upayavira On Sun, Dec 8, 2013, at 11:51 AM, Vulcanoid Developer wrote: > Thanks for your email. > > Great, I will look at the WordDelimiterFactory. Just to make clear, I > DON'T > want any other tokenizing on digits, specialchars, punctuations etc done > other than word delimiting on whitespace. > > All I want for my first version is NO removal of punctuations/special > characters at indexing time and during search time i.e., input as-is and > search as-is (like a simple sql db?) . I was assuming this would be a > trivial case with SOLR and not sure what I am missing here. > > thanks > Vulcanoid > > > > On Sun, Dec 8, 2013 at 4:33 AM, Upayavira <u...@odoko.co.uk> wrote: > > > Have you tried a WhitespaceTokenizerFactory followed by the > > WordDelimiterFilterFactory? The latter is perhaps more configurable at > > what it does. Alternatively, you could use a RegexFilterFactory to > > remove extraneous punctuation that wasn't removed by the Whitespace > > Tokenizer. > > > > Upayavira > > > > On Sat, Dec 7, 2013, at 06:15 PM, Vulcanoid Developer wrote: > > > Hi, > > > > > > I am new to solr and I guess this is a basic tokenizer question so please > > > bear with me. > > > > > > I am trying to use SOLR to index a few (Indian) legal judgments in text > > > form and search against them. One of the key points with these documents > > > is > > > that the sections/provisions of law usually have punctuation/special > > > characters in them. For example search queries will TYPICALLY be section > > > 12AA, section 80-IA, section 9(1)(vii) and the text of the judgments > > > themselves will contain these sort of text with section references all > > > over > > > the place. > > > > > > Now, using a default schema setup with standardtokenizer, which seems to > > > delimit on whitespace AND punctuations, I get really bad results because > > > it > > > looks like 12AA is split and results such having 12 and AA in them turn > > > up. > > > It becomes worse with 9(1)(vii) with results containing 9 and 1 etc > > > being > > > turned up. > > > > > > What is the best solution here? I really just want to index the document > > > as-is and also to do whitespace tokenizing on the search and nothing > > > more. > > > > > > So in other words: > > > a) I would like the text document to be indexed as-is with say 12AA and > > > 9(1)(vii) in the document stored as it is mentioned. > > > b) I would like to be able to search for 12AA and for 9(1)(vii) and get > > > proper full matches on them without any splitting up/munging etc. > > > > > > Any suggestions are appreciated. Thank you for your time. > > > > > > Thanks > > > Vulcanoid > >