Thanks Shawn for the quick reply. Our indexes are running on SSD, so 3 should be ok. Any recommendation on bumping it up?
I guess will have to run optimize for entire solr cloud and see if we can reclaim space. Thanks, Rishi. -----Original Message----- From: Shawn Heisey <apa...@elyograg.org> To: solr-user <solr-user@lucene.apache.org> Sent: Fri, Apr 17, 2015 6:22 pm Subject: Re: Solr Cloud reclaiming disk space from deleted documents On 4/17/2015 2:15 PM, Rishi Easwaran wrote: > Running into an issue and wanted to see if anyone had some suggestions. > We are seeing this with both solr 4.6 and 4.10.3 code. > We are running an extremely update heavy application, with millions of writes and deletes happening to our indexes constantly. An issue we are seeing is that solr cloud reclaiming the disk space that can be used for new inserts, by cleanup up deletes. > > We used to run optimize periodically with our old multicore set up, not sure if that works for solr cloud. > > Num Docs:28762340 > Max Doc:48079586 > Deleted Docs:19317246 > > Version 1429299216227 > Gen 16525463 > Size 109.92 GB > > In our solrconfig.xml we use the following configs. > > <indexConfig> > <!-- Values here affect all index writers and act as a default unless overridden. --> > <useCompoundFile>false</useCompoundFile> > <maxBufferedDocs>1000</maxBufferedDocs> > <maxMergeDocs>2147483647</maxMergeDocs> > <maxFieldLength>10000</maxFieldLength> > > <mergeFactor>10</mergeFactor> > <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"/> > <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> > <int name="maxThreadCount">3</int> > <int name="maxMergeCount">15</int> > </mergeScheduler> > <ramBufferSizeMB>64</ramBufferSizeMB> > > </indexConfig> This part of my response won't help the issue you wrote about, but it can affect performance, so I'm going to mention it. If your indexes are stored on regular spinning disks, reduce mergeScheduler/maxThreadCount to 1. If they are stored on SSD, then a value of 3 is OK. Spinning disks cannot do seeks (read/write head moves) fast enough to handle multiple merging threads properly. All the seek activity required will really slow down merging, which is a very bad thing when your indexing load is high. SSD disks do not have to seek, so multiple threads are OK there. An optimize is the only way to reclaim all of the disk space held by deleted documents. Over time, as segments are merged automatically, deleted doc space will be automatically recovered, but it won't be perfect, especially as segments are merged multiple times into very large segments. If you send an optimize command to a core/collection in SolrCloud, the entire collection will be optimized ... the cloud will do one shard replica (core) at a time until the entire collection has been optimized. There is no way (currently) to ask it to only optimize a single core, or to do multiple cores simultaneously, even if they are on different servers. Thanks, Shawn