Hi,

This is regarding the issue that we are facing with SOLR distributed search.
In our application, we are managing multiple shards at SOLR server to
manage the load. But there is a problem with the order of results that we
going to return to client during the search.

For Example: Currently there are two shards on which data is randomly
distributed.
When I search something, it was observerd that the results from one shard
appear first and then results from other shard.

Moreover, we are ordering results by applying two levels of sorting
(configurable as per user also):
1. Score
2. Modified Time

I did investigations for the above scenario and found that it is not
necessary that documents coming from one shard will always have the same
score as documents coming from other shard, even if they are identical.
I also went through the various SOLR documentations and links, and found
that currently there is a limitation to distributed search in SOLR that
Inverse-document frequency (IDF) calculations cannot be distributed and
TF/IDF computations are per shard.

This issue is particularly visible when there is significant difference
between the number of documents indexed in each shard. (For Ex: first shard
has 15000 docs and second shard has 5000).

Please review and let me know whether our findings for the above scenario
are appropriate or not.

Also, as per our investigation currently there is work ongoing in SOLR
community to support this concept of distributed/Global IDF. But, I wanted
to know if there is any solution possible right now to manage/control the
score of the documents during distributed search, so that the results seem
more relevant.

Thanks
Rashi

Reply via email to