Re: Partial results from streaming expressions (i.e. making them "stream")

Radu Gheorghe Wed, 17 Jan 2018 23:20:07 -0800

Hi Joel, thanks for your follow-up!

Indeed, that's my experience as well - that the export handler streams
data fast enough. Though now that you mention batches, I'm curious if
that batch size is configurable or something like that.


The higher level issue is that I need to show results to the user as
quickly as possible. For example, imagine a rollup on a relatively
high cardinality field, but with lots of documents per user as well. I
want to show counters as soon as they come up, instead of when I have
all of them.

To have results coming in as quickly as possible, I need data to be
streamed quickly (latency-wise) between source and decorators, as well
as from the Solr node receiving the initial request to the client
(UI).

The first part seem to be already happening in my tests (though I've
heard complaints that it doesn't - I'll come back to it if I
misunderstood something), but I can't get partial results to the HTTP
client issuing the original requests.

Does this clarify my issue?

Thanks again and best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Jan 17, 2018 at 8:59 PM, Joel Bernstein <joels...@gmail.com> wrote:
> I'm not sure I understand the issue fully. From a streaming standpoint, you
> get real streamed data from the /export handler. When you use the export
> handler the bitset for the search results is materialized in memory, but
> all result are sorted/streamed in batches. This allows the exported handler
> to export result sets of any size.
>
> The underlying buffer sizes are really abstracted away and not meant to
> dealt with.
>
> What's the higher level issue you are concerned with?
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Jan 17, 2018 at 8:54 AM, Radu Gheorghe <radu.gheor...@sematext.com>
> wrote:
>
>> Hello,
>>
>> I have some updates on this, but it's still not very clear for me how
>> to move forward.
>>
>> The good news is, that between sources and decorators, data seems to
>> be really streamed. I hope I tested this the right way, by simply
>> adding a log message to ReducerStream saying "hey, I got this tuple".
>> Now I have two nodes, nodeA with data and nodeB with a dummy
>> collection. If I hit nodeB's /stream endpoint and ask it for, say, a
>> unique() to wrap the previously mentioned expression (with 1s sleeps),
>> I see a log from ReducerStream every second. Good.
>>
>> Now, the final result (to the client, via curl), only gets to me after
>> N seconds (where N is the number of results I get). I did some more
>> digging on this front, too. Let's assume we have chunked encoding
>> re-enabled (that's a must) and no other change (if I flush() the
>> FastWriter, say, after every tuple, then I get every tuple as it's
>> computed, but I'm trying to explore the buffers). I've noticed the
>> following:
>> - the first response comes after ~64K, then I get chunks of 32K each
>> - at this point, if I set response.setBufferSize() in
>> HttpSolrCall.writeResponse() to a small size (say, 128), I get the
>> first reply after 32K and then 8K chunks
>> - I thought that maybe in this context I could lower BUFSIZE in
>> FastWriter, but that didn't seem to make any change :(
>>
>> That said, I'm not sure it's worth looking into these buffers any
>> deeper, because shrinking them might negatively affect other results
>> (e.g. regular searches or facets). It sounds like the way forward
>> would be that manual flushing, with chunked encoding enabled. I could
>> imagine adding some parameters in the lines of "flush every N tuples
>> or M milliseconds", that would be computed per-request, or at least
>> globally to the /stream handler.
>>
>> What do you think? Would such a patch be welcome, to add these
>> parameters? But it still requires chunked encoding - would reverting
>> SOLR-8669 be a problem? Or maybe there's a more elegant way to enable
>> chunked encoding, maybe only for streams?
>>
>> Best regards,
>> Radu
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Mon, Jan 15, 2018 at 10:58 AM, Radu Gheorghe
>> <radu.gheor...@sematext.com> wrote:
>> > Hello fellow solr-users!
>> >
>> > Currently, if I do an HTTP request to receive some data via streaming
>> > expressions, like:
>> >
>> > curl --data-urlencode 'expr=search(test,
>> >                                    q="foo_s:*",
>> >                                    fl="foo_s",
>> >                                    sort="foo_s asc",
>> >                                    qt="/export")'
>> > http://localhost:8983/solr/test/stream
>> >
>> > I get all results at once. This is more obvious if I simply introduce
>> > a one-second sleep in CloudSolrStream: with three documents, the
>> > request takes about three seconds, and I get all three docs after
>> > three seconds.
>> >
>> > Instead, I would like to get documents in a more "streaming" way. For
>> > example, after X seconds give me what you already have. Or if an
>> > Y-sized buffer fills up, give me all the tuples you have, then resume.
>> >
>> > Any ideas/opinions in terms of how I could achieve this? With or
>> > without changing Solr's code?
>> >
>> > Here's what I have so far:
>> > - this is normal with non-chunked HTTP/1.1. You get all results at
>> > once. If I revert this patch[1] and get Solr to use chunked encoding,
>> > I get partial results every... what seems to be a certain size between
>> > 16KB and 32KB
>> > - I couldn't find a way to manually change this... what I assume is a
>> > buffer size, but failed so far. I've tried changing Jetty's
>> > response.setBufferSize() in HttpSolrCall (maybe the wrong place to do
>> > it?) and also tried changing the default 8KB buffer in FastWriter
>> > - manually flushing the writer (in JSONResponseWriter) gives the
>> > expected results (in combination with chunking)
>> >
>> > The thing is, even if I manage to change the buffer size, I assume
>> > that will apply to all requests (not just streaming expressions). I
>> > assume that ideally it would be configurable per request. As for
>> > manual flushing, that would require changes to the streaming
>> > expressions themselves. Would that be the way to go? What do you
>> > think?
>> >
>> > [1] https://issues.apache.org/jira/secure/attachment/
>> 12787283/SOLR-8669.patch
>> >
>> > Best regards,
>> > Radu
>> > --
>> > Performance Monitoring * Log Analytics * Search Analytics
>> > Solr & Elasticsearch Support * http://sematext.com/
>>

Re: Partial results from streaming expressions (i.e. making them "stream")

Reply via email to