Hi all, I'm trying to use PatternTokenizer and not getting expected results. Not sure where the failure lies. What I'm trying to do is split my input on whitespace except in cases where the whitespace is preceded by a hyphen character. So to do this I'm using a negative look behind assertion in the pattern, e.g. "(?<!-)\s+".
Expected behavior: "foo bar" -> ["foo","bar"] - OK "foo \n bar" -> ["foo","bar"] - OK "foo- bar" -> ["foo- bar"] - OK "foo-\nbar" -> ["foo-\nbar"] - OK "foo- \n bar" -> ["foo- \n bar"] - FAILS Here's a test case that demonstrates the failure: public void testPattern() throws Exception { Map<String,String> args = new HashMap<String, String>(); args.put( PatternTokenizerFactory.GROUP, "-1" ); args.put( PatternTokenizerFactory.PATTERN, "(?<!-)\\s+" ); Reader reader = new StringReader("blah \n foo bar- baz\nfoo-\nbar- baz foo- \n bar"); PatternTokenizerFactory tokFactory = new PatternTokenizerFactory(); tokFactory.init( args ); TokenStream stream = tokFactory.create( reader ); assertTokenStreamContents(stream, new String[] { "blah", "foo", "bar- baz", "foo-\nbar- baz", "foo- \n bar" }); } This fails with the following output: "org.junit.ComparisonFailure: term 4 expected:<foo- [\n bar]> but was:<foo- []>" Am I doing something wrong? Incorrect expectations? Or could this be a bug? Thanks, --jay