Tanguy, You idea is perfect for cases where there is a too many documents with 80-90% documents having same value for a particular field. As an example, your idea is ideal for, lets say we have 10 documents in total like this,
doc1 : <merchantName> Kellog's </merchantName> doc2 : <merchantName> Kellog's </merchantName> doc3 : <merchantName> Kellog's </merchantName> doc4 : <merchantName> Kellog's </merchantName> doc5 : <merchantName> Kellog's </merchantName> doc6 : <merchantName> Kellog's </merchantName> doc7 : <merchantName> Kellog's </merchantName> doc8 : <merchantName> Nestle </merchantName> doc9 : <merchantName> Kellog's </merchantName> doc10 : <merchantName> Kellog's </merchantName> But I have doc1 : <merchantName> Maggi </merchantName> doc2 : <merchantName> Maggi </merchantName> doc3 : <merchantName> M&M's </merchantName> doc4 : <merchantName> M&M's </merchantName> doc5 : <merchantName> Hershey's </merchantName> doc6 : <merchantName> Hershey's </merchantName> doc7 : <merchantName> Nestle </merchantName> doc8 : <merchantName> Nestle </merchantName> doc9 : <merchantName> Kellog's </merchantName> doc10 : <merchantName> Kellog's </merchantName> Thanks, Karthick On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal <tanguy.m...@gmail.com> wrote: > Hello, > > I don't know if that could help, but if I understood your issue, you have > a lot of documents with the same or very close scores. Moreover I think you > get your matches in Merchant order (more or less) because they must be > indexed in that very same order, so solr returns documents of same scores > in insertion order (although there is no contract specifying this) > > You could work around that issue by : > 1/ Turning off tf/idf because you're searching in documents with little > text where only the match counts, but frequencies obviously aren't helping. > 2/ Add a random number to each document at index time, and boost on that > random value at query time, this will shuffle your results, that's probably > the simplest thing to do. > > Hope this helps, > > Tanguy > > 2012/8/20 Karthick Duraisamy Soundararaj <d.s.karth...@gmail.com> > >> Hello Mikhail, >> Thank you for the reply. In terms of user >> experience, I want to spread out the products from same brand farther from >> each other, *atleast* in the first 50-100 results we display. I am >> thinking about two different approaches as solution. >> >> 1. For first few results, display one top scoring >> product of a manufacturer (For a given field, display the top scoring >> results of the unique field values for the first N matches) . This N could >> be either a percentage relative to total matches or a configurable absolute >> value. >> 2. Enforce a penalty on the score for the results >> that have duplicate field values. The penalty can be enforced such a way >> that, the results with higher scores will not be affected as against the >> ones with lower score. >> >> Both of the solutions can be implemented while sorting the documents with >> TopFieldCollector / TopScoreDocCollector. >> >> Does this answer your question? Please let me know if you have any more >> questions. >> >> Thanks, >> Karthick >> >> On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev < >> mkhlud...@griddynamics.com> wrote: >> >>> Hello, >>> >>> I've got the problem description below. Can you explain the expected >>> user experience, and/or solution approach before diving into the algorithm >>> design? >>> >>> Thanks >>> >>> >>> On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj < >>> karthick.soundara...@gmail.com> wrote: >>> >>>> My problem is that when there are a lot of documents representing >>>> products, >>>> products from same manufacturer seem to appear in close proximity in the >>>> results and therefore, it doesnt provide brand diversity. When you >>>> search >>>> for sofas, you get sofas from a manufacturer A dominating the first page >>>> while the sofas from manufacturer B dominating the second page, etc. The >>>> issue here is that a manufacturer tends to describes the different >>>> sofas he >>>> produces the same way and therefore there is a very little difference >>>> between the documents representing two sofas. >>>> >>> >>> >>> >>> -- >>> Sincerely yours >>> Mikhail Khludnev >>> Tech Lead >>> Grid Dynamics >>> >>> <http://www.griddynamics.com> >>> <mkhlud...@griddynamics.com> >>> >>> >> >> >