[ https://issues.apache.org/jira/browse/SOLR-14713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181315#comment-17181315 ]
David Smiley commented on SOLR-14713: ------------------------------------- Okay; I'm just reviewing at a very high level without getting more detailed to code level. What's nice about CUSC is that the client of it (the caller of it) needn't bother with any batching. CUSC internally streams/batches, which further means only a single instantiation of the Solr side request & chain / related glue. I may be misunderstanding what your intent in this PR is, but I'm slightly concerned that your simplification goal results in a lack of streaming/batching which is something critical for performant indexing. I hope https://github.com/TheSearchStack/solr-bench can help us ensure this PR doesn't hurt anything. > Single thread on streaming updates > ---------------------------------- > > Key: SOLR-14713 > URL: https://issues.apache.org/jira/browse/SOLR-14713 > Project: Solr > Issue Type: Improvement > Reporter: Cao Manh Dat > Assignee: Cao Manh Dat > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Or great simplify SolrCmdDistributor > h2. Current way for fan out updates of Solr > Currently on receiving an updateRequest, Solr will create a new > UpdateProcessors for handling that request, then it parses one by one > document from the request and let’s processor handle it. > {code:java} > onReceiving(UpdateRequest update): > processors = createNewProcessors(); > for (Document doc : update) { > processors.handle(doc) > } > {code} > Let’s say the number of replicas in the current shard is N, updateProcessor > will create N-1 queues and runners for each other replica. > Runner is basically a thread that dequeues updates from its corresponding > queue and sends it to a corresponding replica endpoint. > Note 1: all Runners share the same client hence connection pool and same > thread pool. > Note 2: A runner will send all documents of its UpdateRequest in a single > HTTP POST request (to reduce the number of threads for handling requests on > the other side). Therefore its lifetime equals the total time of handling its > UpdateRequest. Below is a typical activity that happens in a runner's life > cycle. > h2. Problems of current approach > The current approach have two problems: > - Problem 1: It uses lots of threads for fan out requests. > - Problem 2 which is more important: it is very complex. Solr is also using > ConcurrentUpdateSolrClient (CUSC for short) for that, CUSC implementation > allows using a single queue but multiple runners for same queue (although we > only use one runner at max) this raise the complexity of the whole flow up to > the top. Single fix for a problem can raise multiple problems later, i.e: in > SOLR-13975 on trying to handle the problem when the other endpoint is hanging > out for so long, we introduced a bug that lets the runner keep running even > when the updateRequest is fully handled in the leader. > h2. Doing everything in single thread > Since we are already supporting sending requests in an async manner, why > don’t we let the main thread which is handling the update request to send > updates to all others without the need of runners or queues. The code will be > something like this > {code:java} > Class UpdateProcessor: > Map<String, OutputStream> pendingOutStreams > > func handleAddDoc(doc): > for (replica: replicas): > pendingOutStreams.get(replica).send(doc) > > func onEndUpdateRequest(): > pendingOutStreams.values().forEach(out -> > closeAndHandleResponse(out)){code} > > By doing this we will use less threads and the code is much more simpler and > cleaner. Of course that there will be some downgrade in the time for handling > an updateRequest since we are doing it serially instead of concurrently. In a > formal way it will be like this > {code:java} > oldTime = timeForIndexing(update) + timeForSendingUpdates(update) > newTime = timeForIndexing(update) + (N-1) * > timeForSendingUpdates(update){code} > But I believe that timeForIndexing is much more than timeForSendingUpdates so > we do not really need to be concerned about this. Even that is really a > problem users can simply create more threads for indexing. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org