Check my thinking on this, wildcard matching in phrases.

Erick Erickson Thu, 13 Mar 2014 18:39:12 -0700

or "why haven't I thought of this before"?

I'm once again being faced with the recurring problem of phrase
searches with wildcards. It'll lead to index bloat, but that's
acceptable in this situation, at least until proved not so.


The surround query parser can deal with wildcards and proximith, but
it doesn't accept anything less than three leading characters, which
is another problem in this case.

I know the complex phrase query parser is out there, but it's not part
of the code base.

So I'm thinking of modifying the EdgeNGramFilter, I've coded up a
prototype that seems to work. Basically, it just appends $ to all the
grams _except_ the last one. I set maxGramSize to 1000, so we'll
assume the final gram is the original term.

So, indexing "my dog has fleas" I get
pos 1 pos 2 pos 3   pos 4
m$      d$         h$      f$
my      do$       ha$    fl$
           dog       has     fle$
                                    flea$
                                    fleas


Now, when users want to search for "m* fleas" within 5 words, they can
search for :
"m$ fleas"~5
or
"m$ fle$"~5
or even
"m$ do$ fle$"~3


and they won't get false matches on something like
"do ha"

You have to accept some simplifications here, of course. This doesn't
handle things like "fle*s" and the like.

I'm also not sure this is general-purpose enough to make an option for
EdgeNGramFilterFactory, the use-case is somewhat restricted. But
that's a relatively natural fit, a new param like
'subGramAppendChar="$" '

Thoughts?

Check my thinking on this, wildcard matching in phrases.

Reply via email to