Re: field normalization and omitNorms

Christian Vogler Wed, 28 May 2008 09:39:01 -0700

On Wednesday 28 May 2008 01:37:57 Otis Gospodnetic wrote:
> If you have tokenized fields of variable size and you want the field length
> to affect the relevance score, then you do not want to omit norms. 
> Omitting norms is good for fields where length is of no importance (e.g.
> gender="Male" vs. gender="Female").  Omitting norms saves you heap/RAM, one
> byte per doc per field without norms, I believe.


I am also toying with the hypothesis that omitting the field norm may be a 
good idea for title fields in languages with compound words, which typically 
consist of only a few words. 

On our server we use a German language stemmer in conjunction with a compound 
word tokenizer, which inserst extra tokens into the stream. With typical 
short titles, such as:

Elterntagung mit Rekordbeteiligung,

which is tokenized as (before stemming):

elterntagung eltern tagung mit rekordbeteiligung rekord beteiligung, 

the title ends up having 7 tokens instead of 3 or even 5, which significantly 
affects the field norms. The reason for retaining the original compound token 
is that it forces compound word queries to return only hits on compound 
words.

In addition, we also have a copied field with just the 3 tokens that skips the 
compound tokenizer, in order to boost queries that match whole words. As a 
consequence, according to the "explain" parameter, the match score for the 
non-compound title fields is *way* out of proportion.

I will have to experiment a bit - one thing that I want to try is moving the 
non-compound field from the qf parameter to the bq parameter, but omitting 
the title field norms is also on my list of things to try.

Best regards
- Christian

Re: field normalization and omitNorms

Reply via email to