Tanguy,
You idea is perfect for cases where there is a too many
documents with 80-90% documents having same value for a particular field.
As an example, your idea is ideal for, lets say we have 10 documents in
total like this,
doc1 : <merchantName> Kellog's </merchantName>
doc2 : <merchantName> Kellog's </merchantName>
doc3 : <merchantName> Kellog's </merchantName>
doc4 : <merchantName> Kellog's </merchantName>
doc5 : <merchantName> Kellog's </merchantName>
doc6 : <merchantName> Kellog's </merchantName>
doc7 : <merchantName> Kellog's </merchantName>
doc8 : <merchantName> Nestle </merchantName>
doc9 : <merchantName> Kellog's </merchantName>
doc10 : <merchantName> Kellog's </merchantName>
But I have
doc1 : <merchantName> Maggi </merchantName>
doc2 : <merchantName> Maggi </merchantName>
doc3 : <merchantName> M&M's </merchantName>
doc4 : <merchantName> M&M's </merchantName>
doc5 : <merchantName> Hershey's </merchantName>
doc6 : <merchantName> Hershey's </merchantName>
doc7 : <merchantName> Nestle </merchantName>
doc8 : <merchantName> Nestle </merchantName>
doc9 : <merchantName> Kellog's </merchantName>
doc10 : <merchantName> Kellog's </merchantName>
Thanks,
Karthick
On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal <[email protected]> wrote:
> Hello,
>
> I don't know if that could help, but if I understood your issue, you have
> a lot of documents with the same or very close scores. Moreover I think you
> get your matches in Merchant order (more or less) because they must be
> indexed in that very same order, so solr returns documents of same scores
> in insertion order (although there is no contract specifying this)
>
> You could work around that issue by :
> 1/ Turning off tf/idf because you're searching in documents with little
> text where only the match counts, but frequencies obviously aren't helping.
> 2/ Add a random number to each document at index time, and boost on that
> random value at query time, this will shuffle your results, that's probably
> the simplest thing to do.
>
> Hope this helps,
>
> Tanguy
>
> 2012/8/20 Karthick Duraisamy Soundararaj <[email protected]>
>
>> Hello Mikhail,
>> Thank you for the reply. In terms of user
>> experience, I want to spread out the products from same brand farther from
>> each other, *atleast* in the first 50-100 results we display. I am
>> thinking about two different approaches as solution.
>>
>> 1. For first few results, display one top scoring
>> product of a manufacturer (For a given field, display the top scoring
>> results of the unique field values for the first N matches) . This N could
>> be either a percentage relative to total matches or a configurable absolute
>> value.
>> 2. Enforce a penalty on the score for the results
>> that have duplicate field values. The penalty can be enforced such a way
>> that, the results with higher scores will not be affected as against the
>> ones with lower score.
>>
>> Both of the solutions can be implemented while sorting the documents with
>> TopFieldCollector / TopScoreDocCollector.
>>
>> Does this answer your question? Please let me know if you have any more
>> questions.
>>
>> Thanks,
>> Karthick
>>
>> On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> I've got the problem description below. Can you explain the expected
>>> user experience, and/or solution approach before diving into the algorithm
>>> design?
>>>
>>> Thanks
>>>
>>>
>>> On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
>>> [email protected]> wrote:
>>>
>>>> My problem is that when there are a lot of documents representing
>>>> products,
>>>> products from same manufacturer seem to appear in close proximity in the
>>>> results and therefore, it doesnt provide brand diversity. When you
>>>> search
>>>> for sofas, you get sofas from a manufacturer A dominating the first page
>>>> while the sofas from manufacturer B dominating the second page, etc. The
>>>> issue here is that a manufacturer tends to describes the different
>>>> sofas he
>>>> produces the same way and therefore there is a very little difference
>>>> between the documents representing two sofas.
>>>>
>>>
>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>> Tech Lead
>>> Grid Dynamics
>>>
>>> <http://www.griddynamics.com>
>>> <[email protected]>
>>>
>>>
>>
>>
>