Re: Disabling tf (term frequency) during indexing and/or scoring

Aaron McKee Fri, 18 Sep 2009 06:39:12 -0700


Hi Alexey,

Thank you for your suggestion! My understanding of Similarity, though,is that this would affect the entire index, whereas I need somethingthat is field-configurable. Looking at Similarity.tf(), it seems to beindependent of the field (and unaware of it). I don't necessarily wantto disable tf entirely, as it'll likely be useful for other fulltextfields. Looking at more of the code, I'm guessing I'll need to get underthe hood a fair bit more and possibly write a custom TermScorer andTermQuery.

I suppose I'm curious why the omitTfAndPositions option conflates twoapparently independent features. It seems like it would have beenentirely reasonable to treat these as separate options, as their usecases don't necessarily overlap. I suppose it was just the path of leastresistance or the assumed common-case scenario.


Anyways, thanks again for your time.

Best regards,
Aaron

Alexey Serba wrote:

Hi Aaron,

You can overwrite default Lucene Similarity and disable tf and
lengthNorm factors in scoring formula ( see
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html
and http://lucene.apache.org/java/2_4_1/api/index.html )

You need to

1) compile the following class and put it into Solr WEB-INF/classes
-------------------------------------------------------------------------------------------------------------------
package my.package;

import org.apache.lucene.search.DefaultSimilarity;

public class NoLengthNormAndTfSimilarity extends DefaultSimilarity {

        public float lengthNorm(String fieldName, int numTerms) {
                return numTerms > 0 ? 1.0f : 0.0f;
        }
                
        public float tf(float freq) {
                return freq > 0 ? 1.0f : 0.0f;
        }
}
-------------------------------------------------------------------------------------------------------------------

2. Add "<similarity class="my.package.NoLengthNormAndTfSimilarity"/>"
into your schema.xml
http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca

HIH,
Alex

On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee <ucbmc...@gmail.com> wrote:

Hello,

Let me preface this by admitting that I'm still fairly new to Lucene and
Solr, so I apologize if any of this sounds naive and I'm open to thinking
about my problem differently.

I'm currently responsible for a rather large dataset of business records
that I'm trying to build a Lucene/Solr infrastructure around, to replace an
in-house solution that we've been using for a few years. These records are
sourced from multiple providers and there's often a fair bit of overlap in
the business coverage. I have a set of fuzzy correlation libraries that I
use to identify these documents and I ultimately create a super-record that
includes metadata from each of the providers. Given the nature of things,
these providers often have slight variations in wording or spelling in the
overlapping fields (it's amazing how many ways people find to refer to the
same business or address). I'd like to capture these variations, as they
facilitate searching, but TF considerations are currently borking field
scoring here.

For example, taking business names into consideration, I have a Solr schema
similar to:

<field name="name_provider1" type="string" indexed="false" stored="false"
multiValued="true">
...
<field name="name_providerN" type="string" indexed="false" stored="false"
multiValued="true">
<field name="nameNorm" type="text" indexed="true" stored="false"
multiValued="true" omitNorms="true">

<copyField source="name_provider1" dest="nameNorm">
...
<copyField source="name_providerN" dest="nameNorm">

For any given business record, there may be 1..N business names present in
the nameNorm field (some with naming variations, some identical). With TF
enabled, however, I'm getting different match scores on this field simply
based on how many providers contributed to the record, which is not
meaningful to me. For example, a record containing <nameNorm>foo
bar<positionIncrementGap>foo bar</nameNorm> is necessarily scoring higher
than a record just containing <nameNorm>foo bar</nameNorm>.  Although I
wouldn't mind TF data being considered within each discrete field value, I
need to find a way to prevent score inflation based simply on the number of
contributing providers.

Looking at the mailing list archive and searching around, it sounds like the
omitTf boolean in Lucene used to function somewhat in this manner, but has
since taken on a broader interpretation (and name) that now also disables
positional and payload data. Unfortunately, phrase support for fields like
this is absolutely essential. So what's the best way to address a need like
this? I guess I don't mind whether this is handled at index time or search
time, but I'm not sure what I may need to override or if there's some
existing provision I should take advantage of.

Thank you for any help you may have.

Best regards,
Aaron

Re: Disabling tf (term frequency) during indexing and/or scoring

Reply via email to