Re: simple tokenizer question

2013-12-08 Thread Josh Lincoln
Have you tried adding autoGeneratePhraseQueries=true to the fieldType without changing the index analysis behavior. This works at querytime only, and will convert 12-34 to "12 34", as if the user entered the query as a phrase. This gives the expected behavior as long as the tokenization is the sam

Re: simple tokenizer question

2013-12-08 Thread Upayavira
If you want to just split on whitespace, then the WhitespaceTokenizer will do the job. However, this will mean that these two tokens aren't the same, and won't match each other: cat cat. A simple regex filter could handle those cases, remove a comma or dot when at the end of a word. Although the

Re: simple tokenizer question

2013-12-08 Thread Vulcanoid Developer
Thanks for your email. Great, I will look at the WordDelimiterFactory. Just to make clear, I DON'T want any other tokenizing on digits, specialchars, punctuations etc done other than word delimiting on whitespace. All I want for my first version is NO removal of punctuations/special characters at

Re: simple tokenizer question

2013-12-07 Thread Upayavira
Have you tried a WhitespaceTokenizerFactory followed by the WordDelimiterFilterFactory? The latter is perhaps more configurable at what it does. Alternatively, you could use a RegexFilterFactory to remove extraneous punctuation that wasn't removed by the Whitespace Tokenizer. Upayavira On Sat, De

simple tokenizer question

2013-12-07 Thread Vulcanoid Developer
Hi, I am new to solr and I guess this is a basic tokenizer question so please bear with me. I am trying to use SOLR to index a few (Indian) legal judgments in text form and search against them. One of the key points with these documents is that the sections/provisions of law usually have punctuat