On Wednesday 28 May 2008 01:37:57 Otis Gospodnetic wrote: > If you have tokenized fields of variable size and you want the field length > to affect the relevance score, then you do not want to omit norms. > Omitting norms is good for fields where length is of no importance (e.g. > gender="Male" vs. gender="Female"). Omitting norms saves you heap/RAM, one > byte per doc per field without norms, I believe.
I am also toying with the hypothesis that omitting the field norm may be a good idea for title fields in languages with compound words, which typically consist of only a few words. On our server we use a German language stemmer in conjunction with a compound word tokenizer, which inserst extra tokens into the stream. With typical short titles, such as: Elterntagung mit Rekordbeteiligung, which is tokenized as (before stemming): elterntagung eltern tagung mit rekordbeteiligung rekord beteiligung, the title ends up having 7 tokens instead of 3 or even 5, which significantly affects the field norms. The reason for retaining the original compound token is that it forces compound word queries to return only hits on compound words. In addition, we also have a copied field with just the 3 tokens that skips the compound tokenizer, in order to boost queries that match whole words. As a consequence, according to the "explain" parameter, the match score for the non-compound title fields is *way* out of proportion. I will have to experiment a bit - one thing that I want to try is moving the non-compound field from the qf parameter to the bq parameter, but omitting the title field norms is also on my list of things to try. Best regards - Christian