Hi Erick,

I think it'a very good idea. 

What happens when you search "my$ dog$"? I think it does not retrieve your 
example document. 
Since * means zero or more chars, I wonder that would be the expected 
behaviour. 

If you inject last token with and without $, would that harm anything?  d$ do$ 
dog$ dog 

Erick, what do you think about LUCENE-5205? It is replacement candidate for 
Surround and ComplexPhrase. It has non of their weaknesses. And its author Tim 
Allison responds very fast to any comments/questions/improvements/bugs etc. By 
the way SOLR-5410 is the wrapper for LUCENE-5205.

Ahmet



On Friday, March 14, 2014 3:38 AM, Erick Erickson <erickerick...@gmail.com> 
wrote:
or "why haven't I thought of this before"?

I'm once again being faced with the recurring problem of phrase
searches with wildcards. It'll lead to index bloat, but that's
acceptable in this situation, at least until proved not so.

The surround query parser can deal with wildcards and proximith, but
it doesn't accept anything less than three leading characters, which
is another problem in this case.

I know the complex phrase query parser is out there, but it's not part
of the code base.

So I'm thinking of modifying the EdgeNGramFilter, I've coded up a
prototype that seems to work. Basically, it just appends $ to all the
grams _except_ the last one. I set maxGramSize to 1000, so we'll
assume the final gram is the original term.

So, indexing "my dog has fleas" I get
pos 1 pos 2 pos 3   pos 4
m$      d$         h$      f$
my      do$       ha$    fl$
           dog       has     fle$
                                    flea$
                                    fleas


Now, when users want to search for "m* fleas" within 5 words, they can
search for :
"m$ fleas"~5
or
"m$ fle$"~5
or even
"m$ do$ fle$"~3


and they won't get false matches on something like
"do ha"

You have to accept some simplifications here, of course. This doesn't
handle things like "fle*s" and the like.

I'm also not sure this is general-purpose enough to make an option for
EdgeNGramFilterFactory, the use-case is somewhat restricted. But
that's a relatively natural fit, a new param like
'subGramAppendChar="$" '

Thoughts?

Reply via email to