Ahmet: I saw your patch updating to 4.7. I have a long plane ride this afternoon that I hope to use to look at it more closely. Thanks for updating it!
And thanks for your comment on putting the $ in the full token, I hadn't thought of that, but I think you're absolutely right. Thanks.... On Fri, Mar 14, 2014 at 4:50 AM, Ahmet Arslan <iori...@yahoo.com> wrote: > Hi Erick, > > I think it'a very good idea. > > What happens when you search "my$ dog$"? I think it does not retrieve your > example document. > Since * means zero or more chars, I wonder that would be the expected > behaviour. > > If you inject last token with and without $, would that harm anything? d$ > do$ dog$ dog > > Erick, what do you think about LUCENE-5205? It is replacement candidate for > Surround and ComplexPhrase. It has non of their weaknesses. And its author > Tim Allison responds very fast to any comments/questions/improvements/bugs > etc. By the way SOLR-5410 is the wrapper for LUCENE-5205. > > Ahmet > > > > On Friday, March 14, 2014 3:38 AM, Erick Erickson <erickerick...@gmail.com> > wrote: > or "why haven't I thought of this before"? > > I'm once again being faced with the recurring problem of phrase > searches with wildcards. It'll lead to index bloat, but that's > acceptable in this situation, at least until proved not so. > > The surround query parser can deal with wildcards and proximith, but > it doesn't accept anything less than three leading characters, which > is another problem in this case. > > I know the complex phrase query parser is out there, but it's not part > of the code base. > > So I'm thinking of modifying the EdgeNGramFilter, I've coded up a > prototype that seems to work. Basically, it just appends $ to all the > grams _except_ the last one. I set maxGramSize to 1000, so we'll > assume the final gram is the original term. > > So, indexing "my dog has fleas" I get > pos 1 pos 2 pos 3 pos 4 > m$ d$ h$ f$ > my do$ ha$ fl$ > dog has fle$ > flea$ > fleas > > > Now, when users want to search for "m* fleas" within 5 words, they can > search for : > "m$ fleas"~5 > or > "m$ fle$"~5 > or even > "m$ do$ fle$"~3 > > > and they won't get false matches on something like > "do ha" > > You have to accept some simplifications here, of course. This doesn't > handle things like "fle*s" and the like. > > I'm also not sure this is general-purpose enough to make an option for > EdgeNGramFilterFactory, the use-case is somewhat restricted. But > that's a relatively natural fit, a new param like > 'subGramAppendChar="$" ' > > Thoughts? >