large scale indexing issues / single threaded bottleneck
Hi everyone, I'm looking for some help with Solr indexing issues on a large scale. We are indexing few terabytes/month on a sizeable Solr cluster (8 masters / serving writes, 16 slaves / serving reads). After certain amount of tuning we got to the point where a single Solr instance can handle index size of 100GB without much issues, but after that we are starting to observe noticeable delays on index flush and they are getting larger. See the attached picture for details, it's done for a single JVM on a single machine. We are posting data in 8 threads using javabin format and doing commit every 5K documents, merge factor 20, and ram buffer size about 384MB. >From the picture it can be seen that a single-threaded index flushing code kicks in on every commit and blocks all other indexing threads. The hardware is decent (12 physical / 24 virtual cores per machine) and it is mostly idle when the index is flushing. Very little CPU utilization and disk I/O (<5%), with the exception of a single CPU core which actually does index flush (95% CPU, 5% I/O wait). My questions are: 1) will Solr changes from real-time branch help to resolve these issues? I was reading http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html and it looks like we have exactly the same problem 2) what would be the best way to port these (and only these) changes to 3.4.0? I tried to dig into the branching and revisions, but got lost quickly. Tried something like "svn diff […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not sure if it's even possible to merge these into 3.4.0 3) what would you recommend for production 24/7 use? 3.4.0? 4) is there a workaround that can be used? also, I listed the stack trace below Thank you! Roman P.S. This single "index flushing" thread spends 99% of all the time in "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then the merge seems to go quickly. I looked it up and it looks like the intent here is deleting old commit points (we are keeping only 1 non-optimized commit point per config). Not sure why is it taking that long. pool-2-thread-1 [RUNNABLE] CPU time: 3:31 java.nio.Bits.copyToByteArray(long, Object, long, long) java.nio.DirectByteBuffer.get(byte[], int, int) org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, int) org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) org.apache.lucene.index.SegmentTermEnum.next() org.apache.lucene.index.TermInfosReader.(Directory, String, FieldInfos, int, int) org.apache.lucene.index.SegmentCoreReaders.(SegmentReader, Directory, SegmentInfo, int, int) org.apache.lucene.index.SegmentReader.get(boolean, Directory, SegmentInfo, int, boolean, int) org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean, int, int) org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean) org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool, List) org.apache.lucene.index.IndexWriter.doFlush(boolean) org.apache.lucene.index.IndexWriter.flush(boolean, boolean) org.apache.lucene.index.IndexWriter.closeInternal(boolean) org.apache.lucene.index.IndexWriter.close(boolean) org.apache.lucene.index.IndexWriter.close() org.apache.solr.update.SolrIndexWriter.close() org.apache.solr.update.DirectUpdateHandler2.closeWriter() org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand) org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run() java.util.concurrent.Executors$RunnableAdapter.call() java.util.concurrent.FutureTask$Sync.innerRun() java.util.concurrent.FutureTask.run() java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run() java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) java.util.concurrent.ThreadPoolExecutor$Worker.run() java.lang.Thread.run()
Re: large scale indexing issues / single threaded bottleneck
I'm wondering if this is relevant: https://issues.apache.org/jira/browse/LUCENE-2680 - Improve how IndexWriter flushes deletes against existing segments Roman On Fri, Oct 28, 2011 at 11:38 AM, Roman Alekseenkov wrote: > Hi everyone, > > I'm looking for some help with Solr indexing issues on a large scale. > > We are indexing few terabytes/month on a sizeable Solr cluster (8 > masters / serving writes, 16 slaves / serving reads). After certain > amount of tuning we got to the point where a single Solr instance can > handle index size of 100GB without much issues, but after that we are > starting to observe noticeable delays on index flush and they are > getting larger. See the attached picture for details, it's done for a > single JVM on a single machine. > > We are posting data in 8 threads using javabin format and doing commit > every 5K documents, merge factor 20, and ram buffer size about 384MB. > From the picture it can be seen that a single-threaded index flushing > code kicks in on every commit and blocks all other indexing threads. > The hardware is decent (12 physical / 24 virtual cores per machine) > and it is mostly idle when the index is flushing. Very little CPU > utilization and disk I/O (<5%), with the exception of a single CPU > core which actually does index flush (95% CPU, 5% I/O wait). > > My questions are: > > 1) will Solr changes from real-time branch help to resolve these > issues? I was reading > http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html > and it looks like we have exactly the same problem > > 2) what would be the best way to port these (and only these) changes > to 3.4.0? I tried to dig into the branching and revisions, but got > lost quickly. Tried something like "svn diff > […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not > sure if it's even possible to merge these into 3.4.0 > > 3) what would you recommend for production 24/7 use? 3.4.0? > > 4) is there a workaround that can be used? also, I listed the stack trace > below > > Thank you! > Roman > > P.S. This single "index flushing" thread spends 99% of all the time in > "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then > the merge seems to go quickly. I looked it up and it looks like the > intent here is deleting old commit points (we are keeping only 1 > non-optimized commit point per config). Not sure why is it taking that > long. > > pool-2-thread-1 [RUNNABLE] CPU time: 3:31 > java.nio.Bits.copyToByteArray(long, Object, long, long) > java.nio.DirectByteBuffer.get(byte[], int, int) > org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, > int) > org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) > org.apache.lucene.index.SegmentTermEnum.next() > org.apache.lucene.index.TermInfosReader.(Directory, String, > FieldInfos, int, int) > org.apache.lucene.index.SegmentCoreReaders.(SegmentReader, > Directory, SegmentInfo, int, int) > org.apache.lucene.index.SegmentReader.get(boolean, Directory, > SegmentInfo, int, boolean, int) > org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, > boolean, int, int) > org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean) > org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool, > List) > org.apache.lucene.index.IndexWriter.doFlush(boolean) > org.apache.lucene.index.IndexWriter.flush(boolean, boolean) > org.apache.lucene.index.IndexWriter.closeInternal(boolean) > org.apache.lucene.index.IndexWriter.close(boolean) > org.apache.lucene.index.IndexWriter.close() > org.apache.solr.update.SolrIndexWriter.close() > org.apache.solr.update.DirectUpdateHandler2.closeWriter() > org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand) > org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run() > java.util.concurrent.Executors$RunnableAdapter.call() > java.util.concurrent.FutureTask$Sync.innerRun() > java.util.concurrent.FutureTask.run() > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run() > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > java.util.concurrent.ThreadPoolExecutor$Worker.run() > java.lang.Thread.run() >
Re: large scale indexing issues / single threaded bottleneck
Guys, thank you for all the replies. I think I have figured out a partial solution for the problem on Friday night. Adding a whole bunch of debug statements to the info stream showed that every document is following "update document" path instead of "add document" path. Meaning that all document IDs are getting into the "pending deletes" queue, and Solr has to rescan its index on every commit for potential deletions. This is single threaded and seems to get progressively slower with the index size. Adding overwrite=false to the URL in /update handler did NOT help, as my debug statements showed that messages still go to updateDocument() function with deleteTerm not being null. So, I hacked Lucene a little bit and set deleteTerm=null as a temporary solution in the beginning of updateDocument(), and it does not call applyDeletes() anymore. This gave a 6-8x performance boost, and now we can index about 9 million documents/hour (producing 20Gb of index every hour). Right now it's at 1TB index size and going, without noticeable degradation of the indexing speed. This is decent, but still the 24-core machine is barely utilized :) Now I think it's hitting a merge bottleneck, where all indexing threads are being paused. And ConcurrentMergeScheduler with 4 threads is not helping much. I guess the changes on the trunk would definitely help, but we will likely stay on 3.4 Will dig more into the issue on Monday. Really curious to see why "overwrite=false" didn't help, but the hack did. Once again, thank you for the answers and recommendations Roman -- View this message in context: http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-threaded-bottleneck-tp3461815p3466523.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: large scale indexing issues / single threaded bottleneck
We have a rate of 2K small docs/sec which translates into 90 GB/day of index space You should be fine Roman Awasthi, Shishir wrote: > > Roman, > How frequently do you update your index? I have a need to do real time > add/delete to SOLR documents at a rate of approximately 20/min. > The total number of documents are in the range of 4 million. Will there > be any performance issues? > > Thanks, > Shishir > -- View this message in context: http://lucene.472066.n3.nabble.com/large-scale-indexing-issues-single-threaded-bottleneck-tp3461815p3472901.html Sent from the Solr - User mailing list archive at Nabble.com.