Hmmm, I tried this in straight Java, no Solr/Lucene involved and the behavior I'm seeing is that no example works if it has more than one whitespace character after the hyphen, including your failure example.
I haven't lived inside regexes for long enough that I don't know what the right regex should be, but it doesn't appear to be a Solr problem Sorry I can't be more helpful. Erick On Mon, Nov 28, 2011 at 12:01 PM, Jay Luker <lb...@reallywow.com> wrote: > Hi all, > > I'm trying to use PatternTokenizer and not getting expected results. > Not sure where the failure lies. What I'm trying to do is split my > input on whitespace except in cases where the whitespace is preceded > by a hyphen character. So to do this I'm using a negative look behind > assertion in the pattern, e.g. "(?<!-)\s+". > > Expected behavior: > "foo bar" -> ["foo","bar"] - OK > "foo \n bar" -> ["foo","bar"] - OK > "foo- bar" -> ["foo- bar"] - OK > "foo-\nbar" -> ["foo-\nbar"] - OK > "foo- \n bar" -> ["foo- \n bar"] - FAILS > > Here's a test case that demonstrates the failure: > > public void testPattern() throws Exception { > Map<String,String> args = new HashMap<String, String>(); > args.put( PatternTokenizerFactory.GROUP, "-1" ); > args.put( PatternTokenizerFactory.PATTERN, "(?<!-)\\s+" ); > Reader reader = new StringReader("blah \n foo bar- > baz\nfoo-\nbar- > baz foo- \n bar"); > PatternTokenizerFactory tokFactory = new PatternTokenizerFactory(); > tokFactory.init( args ); > TokenStream stream = tokFactory.create( reader ); > assertTokenStreamContents(stream, new String[] { "blah", "foo", > "bar- baz", "foo-\nbar- baz", "foo- \n bar" }); > } > > This fails with the following output: > "org.junit.ComparisonFailure: term 4 expected:<foo- [\n bar]> but was:<foo- > []>" > > Am I doing something wrong? Incorrect expectations? Or could this be a bug? > > Thanks, > --jay