Re: Disabling tf (term frequency) during indexing and/or scoring

Alexey Serba Wed, 16 Sep 2009 07:16:14 -0700

Hi Aaron,

You can overwrite default Lucene Similarity and disable tf and
lengthNorm factors in scoring formula ( see
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html
and http://lucene.apache.org/java/2_4_1/api/index.html )


You need to

1) compile the following class and put it into Solr WEB-INF/classes
-------------------------------------------------------------------------------------------------------------------
package my.package;

import org.apache.lucene.search.DefaultSimilarity;

public class NoLengthNormAndTfSimilarity extends DefaultSimilarity {

        public float lengthNorm(String fieldName, int numTerms) {
                return numTerms > 0 ? 1.0f : 0.0f;
        }
                
        public float tf(float freq) {
                return freq > 0 ? 1.0f : 0.0f;
        }
}
-------------------------------------------------------------------------------------------------------------------

2. Add "<similarity class="my.package.NoLengthNormAndTfSimilarity"/>"
into your schema.xml
http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca

HIH,
Alex

On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee <ucbmc...@gmail.com> wrote:
> Hello,
>
> Let me preface this by admitting that I'm still fairly new to Lucene and
> Solr, so I apologize if any of this sounds naive and I'm open to thinking
> about my problem differently.
>
> I'm currently responsible for a rather large dataset of business records
> that I'm trying to build a Lucene/Solr infrastructure around, to replace an
> in-house solution that we've been using for a few years. These records are
> sourced from multiple providers and there's often a fair bit of overlap in
> the business coverage. I have a set of fuzzy correlation libraries that I
> use to identify these documents and I ultimately create a super-record that
> includes metadata from each of the providers. Given the nature of things,
> these providers often have slight variations in wording or spelling in the
> overlapping fields (it's amazing how many ways people find to refer to the
> same business or address). I'd like to capture these variations, as they
> facilitate searching, but TF considerations are currently borking field
> scoring here.
>
> For example, taking business names into consideration, I have a Solr schema
> similar to:
>
> <field name="name_provider1" type="string" indexed="false" stored="false"
> multiValued="true">
> ...
> <field name="name_providerN" type="string" indexed="false" stored="false"
> multiValued="true">
> <field name="nameNorm" type="text" indexed="true" stored="false"
> multiValued="true" omitNorms="true">
>
> <copyField source="name_provider1" dest="nameNorm">
> ...
> <copyField source="name_providerN" dest="nameNorm">
>
> For any given business record, there may be 1..N business names present in
> the nameNorm field (some with naming variations, some identical). With TF
> enabled, however, I'm getting different match scores on this field simply
> based on how many providers contributed to the record, which is not
> meaningful to me. For example, a record containing <nameNorm>foo
> bar<positionIncrementGap>foo bar</nameNorm> is necessarily scoring higher
> than a record just containing <nameNorm>foo bar</nameNorm>.  Although I
> wouldn't mind TF data being considered within each discrete field value, I
> need to find a way to prevent score inflation based simply on the number of
> contributing providers.
>
> Looking at the mailing list archive and searching around, it sounds like the
> omitTf boolean in Lucene used to function somewhat in this manner, but has
> since taken on a broader interpretation (and name) that now also disables
> positional and payload data. Unfortunately, phrase support for fields like
> this is absolutely essential. So what's the best way to address a need like
> this? I guess I don't mind whether this is handled at index time or search
> time, but I'm not sure what I may need to override or if there's some
> existing provision I should take advantage of.
>
> Thank you for any help you may have.
>
> Best regards,
> Aaron
>

Re: Disabling tf (term frequency) during indexing and/or scoring

Reply via email to