Scenario:

You're submitting a block of text as a query.

You're content to let solr / lucene handing query parsing and tokenziation,
etc.

But you'd like to have ALL eventually produced leaf-nodes in the parse tree
to have:
* Boolean .MUST (effectively a + prefix)
* Fuzzy match of ~1 or ~2

In a simple application, and if there were no punctuation, you could
preprocess the query, effectively:
* split on whitespace
* for t in tokens: t = "+" + t + "~2"

But this is ugly, and even then I think things like stop words would be
messed up:
* OK in Solr:   the chair    (it can properly remove "the")
* But if this:    +the~2  +chair~2   (I'm not sure this would work)

Sure, at the application level you could also remove the stop words in the
"for t in tokens" loop, but then some other weird case would come up.
Maybe one of the field's analyzers has some other token filter you forgot
about, so you'd have to bring that logic forward as well.

(Long story of why I'd want to do all this... and I know people think
adding ~2 to all tokens will give bad results anyway, trying to fix
inherited code that can't be scrapped, etc)

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Reply via email to