Re: Reply:Re: How distributed queries works?

Erick Erickson Tue, 22 Jan 2013 17:04:55 -0800

Don't quite know. First of all, this'll be expensive, it's probably a
good idea to ask whether you really need this kind of paging.


But extrapolating from the single node case where ties are broken by
internal document ID, I'd expect that ties are broken by some
combination of internal Lucene doc ID and shard ID, but that's
completely a guess.

You'd probably have to look at the source code to be sure.

Best
Erick

On Tue, Jan 22, 2013 at 10:20 AM, SuoNayi <suonayi2...@163.com> wrote:
> Hi,Erick,thanks for your detailed explanation.
>
> The collecting shard combines the 24060 ID/score pairs into a master list and 
> then
> how it to choose the right  20 docs? It depends on what conditions?
> I assume the collecting shard sort these docs by score and the top 20 docs 
> with higher
> scores are chosen.If these docs have the same scores and how to decide their 
> order?
> If depending on the collecting order then each same query may see the 
> different docs
> on the same page, doesnot it?
> Furthermore,If I want to my query to sort by other field rather than the 
> default score
> what content the other nodes will send to the collecting shard?
>
>
>
>
> Thanks,
> SuoNayi
>
>
>
>
>
> At 2013-01-22 19:57:32,"Erick Erickson" <erickerick...@gmail.com> wrote:
>>bq: does Solr need load all the docs into RAM to calculate score and order
>>
>>You're very close. The query (and this is just like 3.x) is sent to
>>each shard. Let's say your page size is 20 (the &rows=20)
>>
>>Each node will need to keep a list of 8020 documents (400 * 20) + 20,
>>really the ID and score, collect all these and send just the ID and
>>score back to the collecting shard. At that point, the collecting
>>shard combines the 24060 ID/score pairs into a master list and picks
>>the right 20 (8000 - 8020 in the combined list) docs and then asks
>>each shard for the portion of that 20 that were resident on them.
>>
>>"Deep paging" over a sharded situation is pretty expensive, Solr is
>>optimized for returning the top N docs where N is usually pretty
>>small...
>>
>>One minor nit. Solr doesn't load docs into RAM to calculate score,
>>just peruses the index="true" data to calculate score. All that stays
>>in RAM is the doc ID and score _until_ the document contents are
>>assembled, i.e. the raw data is only assembled for &rows docs and then
>>only at the very end...
>>
>>Best
>>Erick
>>
>>On Tue, Jan 22, 2013 at 2:47 AM, SuoNayi <suonayi2...@163.com> wrote:
>>> Dear list,
>>> I want to know the internal mechanism for the distributed queries of 
>>> SolrCloud.
>>> AFAIK,distributed query is supported before the presence of SolrCloud, 
>>> users can
>>> specify shard urls in the query parameters. We can distribute data by time 
>>> interval
>>> in this case.It's called horizontal scalability based on history?
>>> Now SolrCloud do further more because it can discover the other shards(Solr 
>>> instance)
>>> via ZooKeeper and distribute data based on Hash & Mod  of the unique key of 
>>> the doc.
>>> For both cases the requested Solr instance need do scatter queries across 
>>> the shards
>>> and gather the result at last.This process seems like Map-Reduce.
>>> Buy what happens when scattering and gathering? I have read the WIKI but no 
>>> more
>>> details available.I really hope someone can make me clear and give some 
>>> links.
>>>
>>>
>>> Supposing there are 3 shards and 0 replica in my Solr cloud, each shard 
>>> have 150
>>> millions docs.My client query by q=*:* and outputs the results page by 
>>> page.When
>>> the page number is very large,saying 400th page, does Solr need load all 
>>> the docs into
>>> RAM to calculate score and order?
>>>
>>>
>>> Sorry for newbie question and thanks for your time.
>>>
>>>
>>>
>>>
>>> Thanks
>>> SuoNayi
>>>
>>>
>>>
>>>

Re: Reply:Re: How distributed queries works?

Reply via email to