Go ahead and file a Jira and hopefully that will attract some committer
attention that might shed some more light.
Beyond that, sure you can build Solr yourself and change the query parser
code to put a larger number in for maxExpansion.
You might also try developing a test case, say 100 small test documents with
similar values and see if the 50 limit seems to account for behavior that
you see with that test dataset.
-- Jack Krupansky
-----Original Message-----
From: Ryan Wilson
Sent: Thursday, May 16, 2013 11:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Strange fuzzy behavior in 4.2.1
This might explain why our dev database of 400,000 records doesn't seem to
suffer from this. When we started seeing this in our test environment of
300,000,000 records, we thought we just weren't finding records in dev that
were having the problem.
One thing that this does not explain is that we have located a few terms
that find nothing but the original term, despite having possible matches
one edit away. For example, albert will not find anything but albert,
despite there being alberta, albart, etc. I am reading into the
maxExpansion variable and how it functions as I am writing this, so I might
be missing the connection.
I note that you say this is a hardcoded behavior. Would I be safe in
assuming that I will need to build a custom solr.war to make changes to
this setting? I wan to see if sliding this number up/down will let me
confirm that it is indeed maxExpansions that is the problem.
Finally, if it is maxExpansions that is the problem is there any solution
beyond the aforementioned custom war?
-Ryan Wilson
On Thu, May 16, 2013 at 8:40 AM, Jack Krupansky
<j...@basetechnology.com>wrote:
Maybe you are running into the same problem I posted on another message
thread about the hard-coded maxExpansions limit of 50. In other words,
once
Lucene finds 50 terms that do match, it won't find the additional matches.
And that is not necessarily the top 50, but the first 50 in the index.
See if you can reproduce the problem with a small data set of no more than
a couple dozen documents.
-- Jack Krupansky
-----Original Message----- From: Ryan Wilson
Sent: Thursday, May 16, 2013 9:28 AM
To: solr-user@lucene.apache.org
Subject: RE: Strange fuzzy behavior in 4.2.1
In answering your first questions, any changes we’ve been making have been
followed by a reindex.
The data that is being indexed generally looks something like this
(<space>
indicating an actual space):
TIM <space> , <space> JULIO
JULIE <space> , <space> JIM
So based off what we see from looking at top terms in the field and the
analysis tool, at index time these records are being broken up such that
TIM , JULIO can be found with tim or Julio.
Just to make sure I’m not misunderstanding something about Solr/Lucene,
when a record is indexed the index analysis chain result (<tim> <,>
<julio>) is what is written to disk correct? So far as I understand it it’s
the query analysis chain that has the issue with most filters not being
applied during wildcard and fuzzy queries.
Finally, some clarification as I’ve realized my original email might not
have made this point well. I can have a particular record with a primary
key of X and a name value of LEWIS , JULIA and be able to find that exact
record with bulia~1 but not aulia~1, or GUERRERO , JULIAN , JULIAN can
be
found with julan~1 but not julia~1. It’s not that records go missing when
searched for with fuzzy, but rather the fuzzy terms that will find them
seem, to my eyes, inconsistent.
Regards,
Ryan Wilson
rpwils...@gmail.com