That was it. Setting omitNorms="true" on all fields fixed my problem. I left it indexing all weekend, and heap usage still looks great.
I'm still not clear why bouncing the solr instance freed up memory, unless the in-memory structure for this norms data is lazily loaded somehow. Anyway, thank you very much for the suggestion. -Greg On Fri, Dec 27, 2013 at 4:25 AM, Michael McCandless <luc...@mikemccandless.com> wrote: > Likely this is for field norms, which use doc values under the hood. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, Dec 26, 2013 at 5:03 PM, Greg Preston > <gpres...@marinsoftware.com> wrote: >> Does anybody with knowledge of solr internals know why I'm seeing >> instances of Lucene42DocValuesProducer when I don't have any fields >> that are using DocValues? Or am I misunderstanding what this class is >> for? >> >> -Greg >> >> >> On Mon, Dec 23, 2013 at 12:07 PM, Greg Preston >> <gpres...@marinsoftware.com> wrote: >>> Hello, >>> >>> I'm loading up our solr cloud with data (from a solrj client) and >>> running into a weird memory issue. I can reliably reproduce the >>> problem. >>> >>> - Using Solr Cloud 4.4.0 (also replicated with 4.6.0) >>> - 24 solr nodes (one shard each), spread across 3 physical hosts, each >>> host has 256G of memory >>> - index and tlogs on ssd >>> - Xmx=7G, G1GC >>> - Java 1.7.0_25 >>> - schema and solrconfig.xml attached >>> >>> I'm using composite routing to route documents with the same clientId >>> to the same shard. After several hours of indexing, I occasionally >>> see an IndexWriter go OOM. I think that's a symptom. When that >>> happens, indexing continues, and that node's tlog starts to grow. >>> When I notice this, I stop indexing, and bounce the problem node. >>> That's where it gets interesting. >>> >>> Upon bouncing, the tlog replays, and then segments merge. Once the >>> merging is complete, the heap is fairly full, and forced full GC only >>> helps a little. But if I then bounce the node again, the heap usage >>> goes way down, and stays low until the next segment merge. I believe >>> segment merges are also what causes the original OOM. >>> >>> More details: >>> >>> Index on disk for this node is ~13G, tlog is ~2.5G. >>> See attached mem1.png. This is a jconsole view of the heap during the >>> following: >>> >>> (Solr cloud node started at the left edge of this graph) >>> >>> A) One CPU core pegged at 100%. Thread dump shows: >>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800 >>> nid=0x7a74 runnable [0x00007f5a41c5f000] >>> java.lang.Thread.State: RUNNABLE >>> at org.apache.lucene.util.fst.Builder.add(Builder.java:397) >>> at >>> org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000) >>> at >>> org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112) >>> at >>> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72) >>> at >>> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365) >>> at >>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98) >>> at >>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) >>> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) >>> at >>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) >>> at >>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) >>> >>> B) One CPU core pegged at 100%. Manually triggered GC. Lots of >>> memory freed. Thread dump shows: >>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800 >>> nid=0x7a74 runnable [0x00007f5a41c5f000] >>> java.lang.Thread.State: RUNNABLE >>> at >>> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127) >>> at >>> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144) >>> at >>> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92) >>> at >>> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112) >>> at >>> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221) >>> at >>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119) >>> at >>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) >>> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) >>> at >>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) >>> at >>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) >>> >>> C) One CPU core pegged at 100%. Manually triggered GC. No memory >>> freed. Thread dump shows: >>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800 >>> nid=0x7a74 runnable [0x00007f5a41c5f000] >>> java.lang.Thread.State: RUNNABLE >>> at >>> org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127) >>> at >>> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108) >>> at >>> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92) >>> at >>> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112) >>> at >>> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221) >>> at >>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119) >>> at >>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) >>> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) >>> at >>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) >>> at >>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) >>> >>> D) One CPU core pegged at 100%. Thread dump shows: >>> "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800 >>> nid=0x7a74 runnable [0x00007f5a41c5f000] >>> java.lang.Thread.State: RUNNABLE >>> at >>> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322) >>> at >>> org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169) >>> at >>> org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789) >>> at >>> org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312) >>> at >>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130) >>> at >>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) >>> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) >>> at >>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) >>> at >>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) >>> >>> E) CPU usage drops to nominal levels. Thread dump shows no Lucene Merge >>> Thread. >>> >>> F) Manually triggered full GC. Some memory freed, but much remains. >>> >>> G) Restarted solr. Very little memory used. >>> >>> >>> Throughout all of this, there was no indexing or querying happening. >>> In ordered to try to determine what's using up the memory, I took a >>> heap dump at point (F) and analyzed it in Eclipse MAT (see attached >>> screenshot). This shows 311 instances of Lucene42DocValuesProducer$3, >>> each holding a large byte[]. By attaching a remote debugger and >>> re-running, it looks like there is one of these byte[] for each field >>> in the schema (we have several of the "dim_*" dynamic fields). And as >>> far as I know, I'm not using DocValues at all. >>> >>> >>> Any hints as to what might be going on here would be greatly >>> appreciated. It takes me about 10 minutes to reproduce this, so I'm >>> willing to try things. I don't know enough about the internals of >>> solr's memory usage to proceed much further on my own. >>> >>> Thank you. >>> >>> -Greg