Re: Check my thinking on this, wildcard matching in phrases.

Erick Erickson Fri, 14 Mar 2014 06:36:44 -0700

Ahmet:

I saw your patch updating to 4.7. I have a long plane ride this
afternoon that I hope to use to look at it more closely. Thanks for
updating it!


And thanks for your comment on putting the $ in the full token, I
hadn't thought of that, but I think you're absolutely right.

Thanks....

On Fri, Mar 14, 2014 at 4:50 AM, Ahmet Arslan <iori...@yahoo.com> wrote:
> Hi Erick,
>
> I think it'a very good idea.
>
> What happens when you search "my$ dog$"? I think it does not retrieve your 
> example document.
> Since * means zero or more chars, I wonder that would be the expected 
> behaviour.
>
> If you inject last token with and without $, would that harm anything?  d$ 
> do$ dog$ dog
>
> Erick, what do you think about LUCENE-5205? It is replacement candidate for 
> Surround and ComplexPhrase. It has non of their weaknesses. And its author 
> Tim Allison responds very fast to any comments/questions/improvements/bugs 
> etc. By the way SOLR-5410 is the wrapper for LUCENE-5205.
>
> Ahmet
>
>
>
> On Friday, March 14, 2014 3:38 AM, Erick Erickson <erickerick...@gmail.com> 
> wrote:
> or "why haven't I thought of this before"?
>
> I'm once again being faced with the recurring problem of phrase
> searches with wildcards. It'll lead to index bloat, but that's
> acceptable in this situation, at least until proved not so.
>
> The surround query parser can deal with wildcards and proximith, but
> it doesn't accept anything less than three leading characters, which
> is another problem in this case.
>
> I know the complex phrase query parser is out there, but it's not part
> of the code base.
>
> So I'm thinking of modifying the EdgeNGramFilter, I've coded up a
> prototype that seems to work. Basically, it just appends $ to all the
> grams _except_ the last one. I set maxGramSize to 1000, so we'll
> assume the final gram is the original term.
>
> So, indexing "my dog has fleas" I get
> pos 1 pos 2 pos 3   pos 4
> m$      d$         h$      f$
> my      do$       ha$    fl$
>            dog       has     fle$
>                                     flea$
>                                     fleas
>
>
> Now, when users want to search for "m* fleas" within 5 words, they can
> search for :
> "m$ fleas"~5
> or
> "m$ fle$"~5
> or even
> "m$ do$ fle$"~3
>
>
> and they won't get false matches on something like
> "do ha"
>
> You have to accept some simplifications here, of course. This doesn't
> handle things like "fle*s" and the like.
>
> I'm also not sure this is general-purpose enough to make an option for
> EdgeNGramFilterFactory, the use-case is somewhat restricted. But
> that's a relatively natural fit, a new param like
> 'subGramAppendChar="$" '
>
> Thoughts?
>

Re: Check my thinking on this, wildcard matching in phrases.

Reply via email to