If there’s any chance of using Streaming for this rather than re-querying the data using CursorMark, it would solve a lot of these issues.
> On Jun 12, 2019, at 3:26 PM, Mikhail Khludnev <m...@apache.org> wrote: > > Every cursorMark request goes through full results. Previous results just > bypass scoring heap. So, reducing number of such request should reasonably > reduce wall-clock time exporting all results. > > On Wed, Jun 12, 2019 at 11:59 PM Markus Jelsma <markus.jel...@openindex.io> > wrote: > >> Hello, >> >> One of our collections hates CursorMark, it really does. When under very >> heavy load the nodes can occasionally consume GBs additional heap for no >> clear reason immediately after downloading the entire corpus. >> >> Although the additional heap consumption is a separate problem that i hope >> anyone can shed some light on, there is another strange behaviour i would >> like to see explained. >> >> When under little load and with a batch size of just a few hundred, the >> download speed creeps at at most 150 doc/s. But when i increase batch size >> to absurd numbers such as 20k, the speed jumps to 2.5k docs/s. Changing >> total time from days to just a few hours. >> >> We see the heap and the speed differences only really with one big >> collection of millions of small documents. They are just query, click and >> view logs with additional metadata fields such as time, digests, ranks, >> dates, uids, view time etc. >> >> Is there someone here to shed some light on these vague subjects? >> >> Many thanks, >> Markus >> > > > -- > Sincerely yours > Mikhail Khludnev