This pattern split tokens *only* in the presence of parentheses with adjoining 
whitespace, and includes the parentheses with the tokens:

    (?<=\))\s+|\s+(?=\()

So you'll get this kind of behavior:

   Tottenham Hotspur (London)
   F.C. Internationale (milan)
   FC Midtjylland (Herning) (Ikast)

to

   Tottenham Hotspur
   (London)
   F.C. Internationale
   (milan)
   FC Midtjylland 
   (Herning)
   (Ikast)

Steve
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Friday, April 15, 2011 1:51 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Split token
> 
> What you've shown would be handled with WhitespaceTokenizer, but you'd
> have
> to
> prevent filters from stripping the parens. If you have to handle things
> like
> blah ( stuff )
> WhitespaceTokenizer wouldn't work.
> 
> PatternTokenizerFactory might work for you, see:
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternTokeniz
> erFactory.html
> 
> Best
> Erick
> 
> On Tue, Apr 12, 2011 at 6:02 AM, roySolr <royrutten1...@gmail.com> wrote:
> 
> > Hello,
> >
> > I want to split my string when it contains "(". Example:
> >
> > spurs (London)
> > Internationale (milan)
> >
> > to
> >
> > spurs
> > (london)
> > Internationale
> > (milan)
> >
> > What tokenizer can i use to fix this problem?
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Split-token-tp2810772p2810772.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >

Reply via email to