Have you tried adding autoGeneratePhraseQueries=true to the fieldType
without changing the index analysis behavior.
This works at querytime only, and will convert 12-34 to "12 34", as if the
user entered the query as a phrase. This gives the expected behavior as
long as the tokenization is the sam
If you want to just split on whitespace, then the WhitespaceTokenizer
will do the job.
However, this will mean that these two tokens aren't the same, and won't
match each other:
cat
cat.
A simple regex filter could handle those cases, remove a comma or dot
when at the end of a word. Although the
Thanks for your email.
Great, I will look at the WordDelimiterFactory. Just to make clear, I DON'T
want any other tokenizing on digits, specialchars, punctuations etc done
other than word delimiting on whitespace.
All I want for my first version is NO removal of punctuations/special
characters at
Have you tried a WhitespaceTokenizerFactory followed by the
WordDelimiterFilterFactory? The latter is perhaps more configurable at
what it does. Alternatively, you could use a RegexFilterFactory to
remove extraneous punctuation that wasn't removed by the Whitespace
Tokenizer.
Upayavira
On Sat, De
Hi,
I am new to solr and I guess this is a basic tokenizer question so please
bear with me.
I am trying to use SOLR to index a few (Indian) legal judgments in text
form and search against them. One of the key points with these documents is
that the sections/provisions of law usually have punctuat