This might explain why our dev database of 400,000 records doesn't seem to suffer from this. When we started seeing this in our test environment of 300,000,000 records, we thought we just weren't finding records in dev that were having the problem.
One thing that this does not explain is that we have located a few terms that find nothing but the original term, despite having possible matches one edit away. For example, albert will not find anything but albert, despite there being alberta, albart, etc. I am reading into the maxExpansion variable and how it functions as I am writing this, so I might be missing the connection. I note that you say this is a hardcoded behavior. Would I be safe in assuming that I will need to build a custom solr.war to make changes to this setting? I wan to see if sliding this number up/down will let me confirm that it is indeed maxExpansions that is the problem. Finally, if it is maxExpansions that is the problem is there any solution beyond the aforementioned custom war? -Ryan Wilson On Thu, May 16, 2013 at 8:40 AM, Jack Krupansky <j...@basetechnology.com>wrote: > Maybe you are running into the same problem I posted on another message > thread about the hard-coded maxExpansions limit of 50. In other words, once > Lucene finds 50 terms that do match, it won't find the additional matches. > And that is not necessarily the top 50, but the first 50 in the index. > > See if you can reproduce the problem with a small data set of no more than > a couple dozen documents. > > -- Jack Krupansky > -----Original Message----- From: Ryan Wilson > Sent: Thursday, May 16, 2013 9:28 AM > To: solr-user@lucene.apache.org > Subject: RE: Strange fuzzy behavior in 4.2.1 > > > In answering your first questions, any changes we’ve been making have been > followed by a reindex. > > > > The data that is being indexed generally looks something like this (<space> > indicating an actual space): > > > > TIM <space> , <space> JULIO > > JULIE <space> , <space> JIM > > > > So based off what we see from looking at top terms in the field and the > analysis tool, at index time these records are being broken up such that > TIM , JULIO can be found with tim or Julio. > > > > Just to make sure I’m not misunderstanding something about Solr/Lucene, > when a record is indexed the index analysis chain result (<tim> <,> > <julio>) is what is written to disk correct? So far as I understand it it’s > the query analysis chain that has the issue with most filters not being > applied during wildcard and fuzzy queries. > > > > Finally, some clarification as I’ve realized my original email might not > have made this point well. I can have a particular record with a primary > key of X and a name value of LEWIS , JULIA and be able to find that exact > record with bulia~1 but not aulia~1, or GUERRERO , JULIAN , JULIAN can be > found with julan~1 but not julia~1. It’s not that records go missing when > searched for with fuzzy, but rather the fuzzy terms that will find them > seem, to my eyes, inconsistent. > > > > Regards, > > Ryan Wilson > rpwils...@gmail.com >