[ 
https://issues.apache.org/jira/browse/SOLR-14713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181534#comment-17181534
 ] 

Cao Manh Dat commented on SOLR-14713:
-------------------------------------

I will post a report generated by solr-bench, but our internal run by 
[~sarkaramr...@gmail.com] did not introduce any hurt in performance.

> Single thread on streaming updates
> ----------------------------------
>
>                 Key: SOLR-14713
>                 URL: https://issues.apache.org/jira/browse/SOLR-14713
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Cao Manh Dat
>            Assignee: Cao Manh Dat
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Or great simplify SolrCmdDistributor
> h2. Current way for fan out updates of Solr
> Currently on receiving an updateRequest, Solr will create a new 
> UpdateProcessors for handling that request, then it parses one by one 
> document from the request and let’s processor handle it.
> {code:java}
> onReceiving(UpdateRequest update):
>   processors = createNewProcessors();
>   for (Document doc : update) {
>     processors.handle(doc)
> }
> {code}
> Let’s say the number of replicas in the current shard is N, updateProcessor 
> will create N-1 queues and runners for each other replica.
>  Runner is basically a thread that dequeues updates from its corresponding 
> queue and sends it to a corresponding replica endpoint.
> Note 1: all Runners share the same client hence connection pool and same 
> thread pool. 
>  Note 2: A runner will send all documents of its UpdateRequest in a single 
> HTTP POST request (to reduce the number of threads for handling requests on 
> the other side). Therefore its lifetime equals the total time of handling its 
> UpdateRequest. Below is a typical activity that happens in a runner's life 
> cycle.
> h2. Problems of current approach
> The current approach have two problems:
>  - Problem 1: It uses lots of threads for fan out requests.
>  - Problem 2 which is more important: it is very complex. Solr is also using 
> ConcurrentUpdateSolrClient (CUSC for short) for that, CUSC implementation 
> allows using a single queue but multiple runners for same queue (although we 
> only use one runner at max) this raise the complexity of the whole flow up to 
> the top. Single fix for a problem can raise multiple problems later, i.e: in 
> SOLR-13975 on trying to handle the problem when the other endpoint is hanging 
> out for so long, we introduced a bug that lets the runner keep running even 
> when the updateRequest is fully handled in the leader.
> h2. Doing everything in single thread
> Since we are already supporting sending requests in an async manner, why 
> don’t we let the main thread which is handling the update request to send 
> updates to all others without the need of runners or queues. The code will be 
> something like this
> {code:java}
>  Class UpdateProcessor:
>    Map<String, OutputStream> pendingOutStreams
>    
>    func handleAddDoc(doc):
>       for (replica: replicas):
>       pendingOutStreams.get(replica).send(doc)
>    
>    func onEndUpdateRequest():
>       pendingOutStreams.values().forEach(out -> 
> closeAndHandleResponse(out)){code}
>  
> By doing this we will use less threads and the code is much more simpler and 
> cleaner. Of course that there will be some downgrade in the time for handling 
> an updateRequest since we are doing it serially instead of concurrently. In a 
> formal way it will be like this
> {code:java}
>  oldTime = timeForIndexing(update) + timeForSendingUpdates(update)
>  newTime = timeForIndexing(update) + (N-1) * 
> timeForSendingUpdates(update){code}
> But I believe that timeForIndexing is much more than timeForSendingUpdates so 
> we do not really need to be concerned about this. Even that is really a 
> problem users can simply create more threads for indexing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to