Re: CursorMark, batch size/speed

Erick Erickson Wed, 12 Jun 2019 16:27:21 -0700

If there’s any chance of using Streaming for this rather than
re-querying the data using CursorMark, it would solve
a lot of these issues.


> On Jun 12, 2019, at 3:26 PM, Mikhail Khludnev <m...@apache.org> wrote:
> 
> Every cursorMark request goes through full results. Previous results just
> bypass scoring heap. So, reducing number of such request should reasonably
> reduce wall-clock time exporting all results.
> 
> On Wed, Jun 12, 2019 at 11:59 PM Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
>> Hello,
>> 
>> One of our collections hates CursorMark, it really does. When under very
>> heavy load the nodes can occasionally consume GBs additional heap for no
>> clear reason immediately after downloading the entire corpus.
>> 
>> Although the additional heap consumption is a separate problem that i hope
>> anyone can shed some light on, there is another strange behaviour i would
>> like to see explained.
>> 
>> When under little load and with a batch size of just a few hundred, the
>> download speed creeps at at most 150 doc/s. But when i increase batch size
>> to absurd numbers such as 20k, the speed jumps to 2.5k docs/s. Changing
>> total time from days to just a few hours.
>> 
>> We see the heap and the speed differences only really with one big
>> collection of millions of small documents. They are just query, click and
>> view logs with additional metadata fields such as time, digests, ranks,
>> dates, uids, view time etc.
>> 
>> Is there someone here to shed some light on these vague subjects?
>> 
>> Many thanks,
>> Markus
>> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev

Re: CursorMark, batch size/speed

Reply via email to