According to https://issues.apache.org/jira/browse/LUCENE-2605, the Lucene QueryParser tokenizes on white space before giving any text to the Analyzer. This makes it impossible to use multi-term synonyms because the SynonymFilter only receives one word at a time.
Resolution to this would really help with my current project. My project client sells clothing and accessories online. They have plenty of examples of compound words e.g."rain coat". But some of these compound words are really tripping them up. A prime example is that a search for "dress shoes" returns a list of dresses and random shoes (not necessarily dress shoes). I wish that I was able to synonym compound words to single tokens (e.g. "dress shoes => dress_shoes"), but with this whitespace tokenization issue, it's impossible. Has anything happened with this bug recently? For a short time I've got a client that would be willing to pay for this issues to be fixed if it's not too much of a rabbit hole. Anyone care to catch me up with what this might entail? LinkedIn <http://www.linkedin.com/pub/john-berryman/13/b17/864> Twitter <http://twitter.com/#!/jnbrymn>