Re: distributed search mechanism

Eason . Lee Thu, 04 Sep 2008 20:12:07 -0700

2008/8/31 Grégoire Neuville <[EMAIL PROTECTED]>

> Hi all,
>
> I've recently been working with the distributed search capabilities of solr
> to build a web portal ; all is working fine, but it is now time for me to
> describe my work on a "theoretical" point of view.
>
> I've been trying to approximately figure the distributed search mechanism
> out first by browsing the code, but it's too complex for me ; then by
> reading the JIRA comments accompanying the commits where I found this :
>
> ***************
> The search request processing on the set of shards is performed as follows:
>
> STEP 1: The query is built, terms are extracted. Global numDocs and
> docFreqs
> are calculated by requesting all the shards and adding up numDocs and
> docFreqs from each shard.
>
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and
> docFreqs are passed as request parameters. All document fields are NOT
> requested, only document uniqFields and sort fields are requested.
> MoreLikeThis and Highlighting information are NOT requested.
>
> Etc...
> ***************
>
> This is typically the kind of description I need, but I wonder if the one
> cited above is still valid (since it was apparently written quite a time
> before final commit).



The main steps remains the same ,but the details changes a lot
global TF/IDF is not supported yet.


>
> Assuming it is, what's then the difference between the STEPS mentioned and
> the STAGES later introduced (STAGE_START, STAGE_PARSE_QUERY, etc...) ?
>
> How the ranking of the documents in the merged set of responses is
> calculated (especially when sorting on a field) ?


generally speaking:
in the 1st step  only document uniqFields and sort fields are requested
so documents can be merged according to the sort fields,and refetched
(getting all the fields needed) by uniqFields


>
> Finally, does the order of the parameters in the query is significant in a
> distributed search case ? (i.e, is there a difference between :
>   - http://server1:port1
> /solr1/?q=title:blah&shards=server1:port1/solr1,server1:port1/solr2
> and
>   - http://server1:port1
> /solr1/?shards=server1:port1/solr1,server1:port1/solr2&q=title:blah
> ?
> (this last question is more related with the distributed deadlock topic on
> the wiki. : my understanding is that in my first example the "title:blah"
> query is send as a top level query to solr1 and as a "shard query" to both
> solr1 and solr2 (deadlock risk) ; while in the second example, "title:blah"
> is not sent to solr1 as a top level query. Am I right ?))


there is no difference between two  queries above,since all parameters are
put into a map.
search of the query is not executed on the top level ,just done on the
shards list.
the query send to the shard will add an isShard option, so shards will just
do the search without sending query to the shards.


>
> That's a lot if question and a too long post maybe : sorry.
>
> Thanks a lot if you feel the courage to answer,
>

the answer above is just my understanding , not official :)


>
> --
> Grégoire Neuville
>

Re: distributed search mechanism

Reply via email to