[ https://issues.apache.org/jira/browse/SOLR-14713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181534#comment-17181534 ]
Cao Manh Dat commented on SOLR-14713: ------------------------------------- I will post a report generated by solr-bench, but our internal run by [~sarkaramr...@gmail.com] did not introduce any hurt in performance. > Single thread on streaming updates > ---------------------------------- > > Key: SOLR-14713 > URL: https://issues.apache.org/jira/browse/SOLR-14713 > Project: Solr > Issue Type: Improvement > Reporter: Cao Manh Dat > Assignee: Cao Manh Dat > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Or great simplify SolrCmdDistributor > h2. Current way for fan out updates of Solr > Currently on receiving an updateRequest, Solr will create a new > UpdateProcessors for handling that request, then it parses one by one > document from the request and let’s processor handle it. > {code:java} > onReceiving(UpdateRequest update): > processors = createNewProcessors(); > for (Document doc : update) { > processors.handle(doc) > } > {code} > Let’s say the number of replicas in the current shard is N, updateProcessor > will create N-1 queues and runners for each other replica. > Runner is basically a thread that dequeues updates from its corresponding > queue and sends it to a corresponding replica endpoint. > Note 1: all Runners share the same client hence connection pool and same > thread pool. > Note 2: A runner will send all documents of its UpdateRequest in a single > HTTP POST request (to reduce the number of threads for handling requests on > the other side). Therefore its lifetime equals the total time of handling its > UpdateRequest. Below is a typical activity that happens in a runner's life > cycle. > h2. Problems of current approach > The current approach have two problems: > - Problem 1: It uses lots of threads for fan out requests. > - Problem 2 which is more important: it is very complex. Solr is also using > ConcurrentUpdateSolrClient (CUSC for short) for that, CUSC implementation > allows using a single queue but multiple runners for same queue (although we > only use one runner at max) this raise the complexity of the whole flow up to > the top. Single fix for a problem can raise multiple problems later, i.e: in > SOLR-13975 on trying to handle the problem when the other endpoint is hanging > out for so long, we introduced a bug that lets the runner keep running even > when the updateRequest is fully handled in the leader. > h2. Doing everything in single thread > Since we are already supporting sending requests in an async manner, why > don’t we let the main thread which is handling the update request to send > updates to all others without the need of runners or queues. The code will be > something like this > {code:java} > Class UpdateProcessor: > Map<String, OutputStream> pendingOutStreams > > func handleAddDoc(doc): > for (replica: replicas): > pendingOutStreams.get(replica).send(doc) > > func onEndUpdateRequest(): > pendingOutStreams.values().forEach(out -> > closeAndHandleResponse(out)){code} > > By doing this we will use less threads and the code is much more simpler and > cleaner. Of course that there will be some downgrade in the time for handling > an updateRequest since we are doing it serially instead of concurrently. In a > formal way it will be like this > {code:java} > oldTime = timeForIndexing(update) + timeForSendingUpdates(update) > newTime = timeForIndexing(update) + (N-1) * > timeForSendingUpdates(update){code} > But I believe that timeForIndexing is much more than timeForSendingUpdates so > we do not really need to be concerned about this. Even that is really a > problem users can simply create more threads for indexing. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org