Hmmm, I tried this in straight Java, no Solr/Lucene involved and the
behavior I'm seeing is that no example works if it has more than
one whitespace character after the hyphen, including your failure
example.

I haven't lived inside regexes for long enough that I don't know what
the right regex should be, but it doesn't appear to be a Solr problem

Sorry I can't be more helpful.
Erick

On Mon, Nov 28, 2011 at 12:01 PM, Jay Luker <lb...@reallywow.com> wrote:
> Hi all,
>
> I'm trying to use PatternTokenizer and not getting expected results.
> Not sure where the failure lies. What I'm trying to do is split my
> input on whitespace except in cases where the whitespace is preceded
> by a hyphen character. So to do this I'm using a negative look behind
> assertion in the pattern, e.g. "(?<!-)\s+".
>
> Expected behavior:
> "foo bar" -> ["foo","bar"] - OK
> "foo \n bar" -> ["foo","bar"] - OK
> "foo- bar" -> ["foo- bar"] - OK
> "foo-\nbar" -> ["foo-\nbar"] - OK
> "foo- \n bar" -> ["foo- \n bar"] - FAILS
>
> Here's a test case that demonstrates the failure:
>
>        public void testPattern() throws Exception {
>                Map<String,String> args = new HashMap<String, String>();
>                args.put( PatternTokenizerFactory.GROUP, "-1" );
>                args.put( PatternTokenizerFactory.PATTERN, "(?<!-)\\s+" );
>                Reader reader = new StringReader("blah \n foo bar- 
> baz\nfoo-\nbar-
> baz foo- \n bar");
>            PatternTokenizerFactory tokFactory = new PatternTokenizerFactory();
>            tokFactory.init( args );
>            TokenStream stream = tokFactory.create( reader );
>            assertTokenStreamContents(stream, new String[] { "blah", "foo",
> "bar- baz", "foo-\nbar- baz", "foo- \n bar" });
>        }
>
> This fails with the following output:
> "org.junit.ComparisonFailure: term 4 expected:<foo- [\n bar]> but was:<foo- 
> []>"
>
> Am I doing something wrong? Incorrect expectations? Or could this be a bug?
>
> Thanks,
> --jay

Reply via email to