Hi Aaron, You can overwrite default Lucene Similarity and disable tf and lengthNorm factors in scoring formula ( see http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html and http://lucene.apache.org/java/2_4_1/api/index.html )
You need to 1) compile the following class and put it into Solr WEB-INF/classes ------------------------------------------------------------------------------------------------------------------- package my.package; import org.apache.lucene.search.DefaultSimilarity; public class NoLengthNormAndTfSimilarity extends DefaultSimilarity { public float lengthNorm(String fieldName, int numTerms) { return numTerms > 0 ? 1.0f : 0.0f; } public float tf(float freq) { return freq > 0 ? 1.0f : 0.0f; } } ------------------------------------------------------------------------------------------------------------------- 2. Add "<similarity class="my.package.NoLengthNormAndTfSimilarity"/>" into your schema.xml http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca HIH, Alex On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee <ucbmc...@gmail.com> wrote: > Hello, > > Let me preface this by admitting that I'm still fairly new to Lucene and > Solr, so I apologize if any of this sounds naive and I'm open to thinking > about my problem differently. > > I'm currently responsible for a rather large dataset of business records > that I'm trying to build a Lucene/Solr infrastructure around, to replace an > in-house solution that we've been using for a few years. These records are > sourced from multiple providers and there's often a fair bit of overlap in > the business coverage. I have a set of fuzzy correlation libraries that I > use to identify these documents and I ultimately create a super-record that > includes metadata from each of the providers. Given the nature of things, > these providers often have slight variations in wording or spelling in the > overlapping fields (it's amazing how many ways people find to refer to the > same business or address). I'd like to capture these variations, as they > facilitate searching, but TF considerations are currently borking field > scoring here. > > For example, taking business names into consideration, I have a Solr schema > similar to: > > <field name="name_provider1" type="string" indexed="false" stored="false" > multiValued="true"> > ... > <field name="name_providerN" type="string" indexed="false" stored="false" > multiValued="true"> > <field name="nameNorm" type="text" indexed="true" stored="false" > multiValued="true" omitNorms="true"> > > <copyField source="name_provider1" dest="nameNorm"> > ... > <copyField source="name_providerN" dest="nameNorm"> > > For any given business record, there may be 1..N business names present in > the nameNorm field (some with naming variations, some identical). With TF > enabled, however, I'm getting different match scores on this field simply > based on how many providers contributed to the record, which is not > meaningful to me. For example, a record containing <nameNorm>foo > bar<positionIncrementGap>foo bar</nameNorm> is necessarily scoring higher > than a record just containing <nameNorm>foo bar</nameNorm>. Although I > wouldn't mind TF data being considered within each discrete field value, I > need to find a way to prevent score inflation based simply on the number of > contributing providers. > > Looking at the mailing list archive and searching around, it sounds like the > omitTf boolean in Lucene used to function somewhat in this manner, but has > since taken on a broader interpretation (and name) that now also disables > positional and payload data. Unfortunately, phrase support for fields like > this is absolutely essential. So what's the best way to address a need like > this? I guess I don't mind whether this is handled at index time or search > time, but I'm not sure what I may need to override or if there's some > existing provision I should take advantage of. > > Thank you for any help you may have. > > Best regards, > Aaron >