If you want to just split on whitespace, then the WhitespaceTokenizer
will do the job.

However, this will mean that these two tokens aren't the same, and won't
match each other:

cat
cat.

A simple regex filter could handle those cases, remove a comma or dot
when at the end of a word. Although there are other similar situations
(quotes, colons, etc) that you may want to handle eventually.

Upayavira

On Sun, Dec 8, 2013, at 11:51 AM, Vulcanoid Developer wrote:
> Thanks for your email.
> 
> Great, I will look at the WordDelimiterFactory. Just to make clear, I
> DON'T
> want any other tokenizing on digits, specialchars, punctuations etc done
> other than word delimiting on whitespace.
> 
> All I want for my first version is NO removal of punctuations/special
> characters at indexing time and during search time i.e., input as-is and
> search as-is (like a simple sql db?) . I was assuming this would be a
> trivial case with SOLR and not sure what I am missing here.
> 
> thanks
> Vulcanoid
> 
> 
> 
> On Sun, Dec 8, 2013 at 4:33 AM, Upayavira <u...@odoko.co.uk> wrote:
> 
> > Have you tried a WhitespaceTokenizerFactory followed by the
> > WordDelimiterFilterFactory? The latter is perhaps more configurable at
> > what it does. Alternatively, you could use a RegexFilterFactory to
> > remove extraneous punctuation that wasn't removed by the Whitespace
> > Tokenizer.
> >
> > Upayavira
> >
> > On Sat, Dec 7, 2013, at 06:15 PM, Vulcanoid Developer wrote:
> > > Hi,
> > >
> > > I am new to solr and I guess this is a basic tokenizer question so please
> > > bear with me.
> > >
> > > I am trying to use SOLR to index a few (Indian) legal judgments in text
> > > form and search against them. One of the key points with these documents
> > > is
> > > that the sections/provisions of law usually have punctuation/special
> > > characters in them. For example search queries will TYPICALLY be section
> > > 12AA, section 80-IA, section 9(1)(vii) and the text of the judgments
> > > themselves will contain these sort of text with section references all
> > > over
> > > the place.
> > >
> > > Now, using a default schema setup with standardtokenizer, which seems to
> > > delimit on whitespace AND punctuations, I get really bad results because
> > > it
> > > looks like 12AA is split and results such having 12 and AA in them turn
> > > up.
> > >  It becomes worse with 9(1)(vii) with results containing 9 and 1 etc
> > >  being
> > > turned up.
> > >
> > > What is the best solution here? I really just want to index the document
> > > as-is and also to do whitespace tokenizing on the search and nothing
> > > more.
> > >
> > > So in other words:
> > > a) I would like the text document to be indexed as-is with say 12AA and
> > > 9(1)(vii) in the document stored as it is mentioned.
> > > b) I would like to be able to search for 12AA and for 9(1)(vii) and get
> > > proper full matches on them without any splitting up/munging etc.
> > >
> > > Any suggestions are appreciated.  Thank you for your time.
> > >
> > > Thanks
> > > Vulcanoid
> >

Reply via email to