I thought I'd summarize a method that solved the problem we were having trying to optimize a large shard that was running out of disk space, df=100% (400g), du=~380g. After we ran out of space, if we restarted tomcat, segment files disappeared from disk leaving 3 segments.

What worked: we used the <optimize maxSegments=... functionality to optimize in maxSegments stages of powers of 2: 16, 8, 4, 2, 1. We did not see the merged segment files from previous generations left on disk. The staged optimize was as fast as optimizing once to a single segment which was the case which ran out of space.

We were not adding documents to the index. We committed before doing the staged optimize. We do not delete documents. We do not use replication/distribution/snapshooter. We do not autocommit.

400g LVM volume, 192g/30 segment shard, optimized: 188g

solrconfig:

<useCompoundFile>false</useCompoundFile>
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>32</ramBufferSizeMB>
<maxMergeDocs>2147483647</maxMergeDocs>
<maxFieldLength>10000000</maxFieldLength>
<unlockOnStartup>false</unlockOnStartup>
<reopenReaders>true</reopenReaders>
<deletionPolicy class="solr.SolrDeletionPolicy">
   <str name="keepOptimizedOnly">false</str>
   <str name="maxCommitsToKeep">1</str>

schema:

<field name="id" type="string" indexed="true" stored="true" required="true"/> <field name="ocr" type="CommonGramTest" indexed="true" stored="false" required="true"/> <field name="title" type="string" indexed="true" stored="true" multiValued="true" required="true"/> <field name="rights" type="sint" indexed="true" stored="true" required="true"/> <field name="author" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="date" type="string" indexed="true" stored="true"/>


Phil
hathitrust.org

Reply via email to