Issues with whitespace tokenization in QueryParser

John Berryman Sun, 10 Jun 2012 20:03:24 -0700

According to https://issues.apache.org/jira/browse/LUCENE-2605, the Lucene
QueryParser tokenizes on white space before giving any text to the
Analyzer. This makes it impossible to use multi-term synonyms because the
SynonymFilter only receives one word at a time.


Resolution to this would really help with my current project. My project
client sells clothing and accessories online. They have plenty of examples
of compound words e.g."rain coat". But some of these compound words are
really tripping them up. A prime example is that a search for "dress shoes"
returns a list of dresses and random shoes (not necessarily dress shoes). I
wish that I was able to synonym compound words to single tokens (e.g.
"dress shoes => dress_shoes"), but with this whitespace tokenization issue,
it's impossible.

Has anything happened with this bug recently? For a short time I've got a
client that would be willing to pay for this issues to be fixed if it's not
too much of a rabbit hole. Anyone care to catch me up with what this might
entail?

LinkedIn <http://www.linkedin.com/pub/john-berryman/13/b17/864>
Twitter <http://twitter.com/#!/jnbrymn>

Issues with whitespace tokenization in QueryParser

Reply via email to