: Max - field collapsing may be your friend - 
https://issues.apache.org/jira/browse/SOLR-236

that doesn't really seem related ... i don't believe Max wants to see all 
results from a store "collapsed" into on result, i think he wants to see 
results from differnet stores treated "more fairly" and to eliminate the 
clustering effect he's seeing where differnet products from the same store 
tend to have similar scores because of the way the store provides the data 
(and not because of any inherent relevancy of hte proudcts)

Max: to really diagnose something like this, you have to consider all the 
details about what exactly your queries look like and spend a lot of time 
looking at score explanations to really get a sense for the "trend" of why 
certain stores score higher then others.

off the cuff, the only thing i can comment on is this specific example you 
made...

: > Shop 'foo' describes its products with 250 words and uses the searched
: > word once. Shop 'bar' describes its products with only 25 words and also
: > uses the searched word once. The score for shop 'foo' will be much worst
: > than for shop 'bar'. In a search in which are many products of shop
: > 'foo' and 'bar' the products of shop 'bar' are shown before the products
: > of shop 'foo'.

depending on how you look at it, 'foo' is spamming you with excess 
keywords and bar deserves to get higher scores.  eliminating "tf" 
probably isn't wise, but you might want to consider omiting norms, so the 
length of hte field doesn't factor in ... or you might want to try 
customizing your lengthNorm function (requires writing a SImilarity class) 
to make it flatter for 25-250 terms, but have a sharp spike if they go 
above 250 (if you consider 250 the threshold for a product description 
before you decide it's "spam").  You could also consider adding a 
numeric "shop_fudge_factor" field that you populate with a number 
indicating the average number of terms in product descriptions from that 
shop (you'd have to compute this yourself and add it to every document) 
and then use that as part of a FunctionQuery to fudge the scores for 
stores that are long winded a little higher.

I would never do that personally though (it encourages keyword spamming in 
product descriptions) but it's something you can try.

A suggestion of *least* resort: if you customize your Similarity class 
such that all the methods round the score components to very course 
granularity (ie: 1.2 instead of 1.234567) you should wind up with more 
tight groupings of products with the *exact* same score ... you could then 
do a secondary sort on something else (random perhaps?) to try and make 
the ordering more fair.  (i really have no idea how well that might work)


-Hoss

Reply via email to