Re: Solr mapred MTree merge stage hangs repeatably in 4.10 (but not 4.9)

Brett Hoerner Tue, 23 Sep 2014 11:05:54 -0700

To be clear, those exceptions are during the "main" mapred job that is
creating the many small indexes. If these errors above occur (they don't
fail the job), I am 99% sure that is when the MTree job later hangs.


On Tue, Sep 23, 2014 at 1:02 PM, Brett Hoerner <br...@bretthoerner.com>
wrote:

> I believe these are related (they are new to me), anyone seen anything
> like this in Solr mapred?
>
>
>
> Error: java.io.IOException:
> org.apache.solr.client.solrj.SolrServerException:
> org.apache.solr.client.solrj.SolrServerException:
> org.apache.lucene.index.CorruptIndexException: checksum failed (hardware
> problem?) : expected=5fb8f6da actual=8b048ec4
> (resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip))
>         at
> org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:307)
>         at
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:558)
>         at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:637)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
> Caused by: org.apache.solr.client.solrj.SolrServerException:
> org.apache.solr.client.solrj.SolrServerException:
> org.apache.lucene.index.CorruptIndexException: checksum failed (hardware
> problem?) : expected=5fb8f6da actual=8b048ec4
> (resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip))
>         at
> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:223)
>         at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
>         at
> org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
>         at org.apache.solr.hadoop.BatchWriter.close(BatchWriter.java:200)
>         at
> org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:295)
>         ... 8 more
> Caused by: org.apache.solr.client.solrj.SolrServerException:
> org.apache.lucene.index.CorruptIndexException: checksum failed (hardware
> problem?) : expected=5fb8f6da actual=8b048ec4
> (resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip))
>         at
> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:155)
>         ... 12 more
> Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed
> (hardware problem?) : expected=5fb8f6da actual=8b048ec4
> (resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip))
>         at
> org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:211)
>         at
> org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:268)
>         at
> org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:125)
>         at
> org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:441)
>         at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:197)
>         at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254)
>         at
> org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120)
>         at
> org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108)
>         at
> org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:143)
>         at
> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:282)
>         at
> org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3315)
>         at
> org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3306)
>         at
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3020)
>         at
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3169)
>         at
> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3136)
>         at
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:582)
>         at
> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
>         at
> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
>         at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1648)
>         at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1625)
>         at
> org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:157)
>         at
> org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:69)
>         at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
>         at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
>         at
> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
>         ... 12 more
>
>
>
>
> [...snip...] another similar failure:
>
>
>
>
> 14/09/23 17:52:55 INFO mapreduce.Job: Task Id :
> attempt_1411487144915_0006_r_000046_0, Status : FAILED
> Error: java.io.IOException: org.apache.solr.common.SolrException: Error
> opening new searcher
>         at
> org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:307)
>         at
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:558)
>         at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:637)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
> Caused by: org.apache.solr.common.SolrException: Error opening new searcher
>         at
> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565)
>         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677)
>         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1421)
>         at
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:615)
>         at
> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
>         at
> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
>         at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1648)
>         at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1625)
>         at
> org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:157)
>         at
> org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:69)
>         at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
>         at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
>         at
> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
>         at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
>         at
> org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168)
>         at org.apache.solr.hadoop.BatchWriter.close(BatchWriter.java:200)
>         at
> org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:295)
>         ... 8 more
> Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed
> (hardware problem?) : expected=d9019857 actual=632aa4e2
> (resource=BufferedChecksumIndexInput(_1i_Lucene41_0.tip))
>         at
> org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:211)
>         at
> org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:268)
>         at
> org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:125)
>         at
> org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:441)
>         at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:197)
>         at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254)
>         at
> org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120)
>         at
> org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108)
>         at
> org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:143)
>         at
> org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:237)
>         at
> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:104)
>         at
> org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:426)
>         at
> org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:292)
>         at
> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:277)
>         at
> org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:251)
>         at
> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1476)
>         ... 25 more
>
>
> On Tue, Sep 16, 2014 at 12:54 PM, Brett Hoerner <br...@bretthoerner.com>
> wrote:
>
>> I have a very weird problem that I'm going to try to describe here to see
>> if anyone has any "ah-ha" moments or clues. I haven't created a small
>> reproducible project for this but I guess I will have to try in the future
>> if I can't figure it out. (Or I'll need to bisect by running long Hadoop
>> jobs...)
>>
>> So, the facts:
>>
>> * Have been successfully using Solr mapred to build very large Solr
>> clusters for months
>> * As of Solr 4.10 *some* job sizes repeatably hang in the MTree merge
>> phase in 4.10
>> * Those same jobs (same input, output, and Hadoop cluster itself) succeed
>> if I only change my Solr deps to 4.9
>> * The job *does succeed* in 4.10 if I use the same data to create more,
>> but smaller shards (e.g. 12x as many shards each 1/12th the size of the job
>> that fails)
>> * Creating my "normal size" shards (the size I want, that works in 4.9)
>> the job hangs with 2 mappers running, 0 reducers in the MTree merge phase
>> * There are no errors or warning in the syslog/stderr of the MTree
>> mappers, no errors ever echo'd back to the "interactive run" of the job
>> (mapper says 100%, reduce says 0%, will stay forever)
>> * No CPU being used on the boxes running the merge, no GC happening, JVM
>> waiting on a futex, all threads blocked on various queues
>> * No disk usage problems, nothing else obviously wrong with any box in
>> the cluster
>>
>> I diff'ed around between 4.10 and 4.9 and barely see any changes in
>> mapred contrib, mostly some test stuff. I didn't see any transitive
>> dependency changes in Solr/Lucene that look like they would affect me.
>>
>
>

Re: Solr mapred MTree merge stage hangs repeatably in 4.10 (but not 4.9)

Reply via email to