To be clear, those exceptions are during the "main" mapred job that is creating the many small indexes. If these errors above occur (they don't fail the job), I am 99% sure that is when the MTree job later hangs.
On Tue, Sep 23, 2014 at 1:02 PM, Brett Hoerner <br...@bretthoerner.com> wrote: > I believe these are related (they are new to me), anyone seen anything > like this in Solr mapred? > > > > Error: java.io.IOException: > org.apache.solr.client.solrj.SolrServerException: > org.apache.solr.client.solrj.SolrServerException: > org.apache.lucene.index.CorruptIndexException: checksum failed (hardware > problem?) : expected=5fb8f6da actual=8b048ec4 > (resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip)) > at > org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:307) > at > org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:558) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:637) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) > Caused by: org.apache.solr.client.solrj.SolrServerException: > org.apache.solr.client.solrj.SolrServerException: > org.apache.lucene.index.CorruptIndexException: checksum failed (hardware > problem?) : expected=5fb8f6da actual=8b048ec4 > (resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip)) > at > org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:223) > at > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) > at > org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168) > at org.apache.solr.hadoop.BatchWriter.close(BatchWriter.java:200) > at > org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:295) > ... 8 more > Caused by: org.apache.solr.client.solrj.SolrServerException: > org.apache.lucene.index.CorruptIndexException: checksum failed (hardware > problem?) : expected=5fb8f6da actual=8b048ec4 > (resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip)) > at > org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:155) > ... 12 more > Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed > (hardware problem?) : expected=5fb8f6da actual=8b048ec4 > (resource=BufferedChecksumIndexInput(_1e_Lucene41_0.tip)) > at > org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:211) > at > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:268) > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:125) > at > org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:441) > at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:197) > at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254) > at > org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120) > at > org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108) > at > org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:143) > at > org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:282) > at > org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3315) > at > org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3306) > at > org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3020) > at > org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3169) > at > org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3136) > at > org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:582) > at > org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95) > at > org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1648) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1625) > at > org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:157) > at > org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:69) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) > at > org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150) > ... 12 more > > > > > [...snip...] another similar failure: > > > > > 14/09/23 17:52:55 INFO mapreduce.Job: Task Id : > attempt_1411487144915_0006_r_000046_0, Status : FAILED > Error: java.io.IOException: org.apache.solr.common.SolrException: Error > opening new searcher > at > org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:307) > at > org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:558) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:637) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) > Caused by: org.apache.solr.common.SolrException: Error opening new searcher > at > org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677) > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1421) > at > org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:615) > at > org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95) > at > org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1648) > at > org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1625) > at > org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:157) > at > org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:69) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) > at > org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150) > at > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) > at > org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:168) > at org.apache.solr.hadoop.BatchWriter.close(BatchWriter.java:200) > at > org.apache.solr.hadoop.SolrRecordWriter.close(SolrRecordWriter.java:295) > ... 8 more > Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed > (hardware problem?) : expected=d9019857 actual=632aa4e2 > (resource=BufferedChecksumIndexInput(_1i_Lucene41_0.tip)) > at > org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:211) > at > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:268) > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:125) > at > org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:441) > at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:197) > at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:254) > at > org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120) > at > org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108) > at > org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:143) > at > org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:237) > at > org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:104) > at > org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:426) > at > org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:292) > at > org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:277) > at > org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:251) > at > org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1476) > ... 25 more > > > On Tue, Sep 16, 2014 at 12:54 PM, Brett Hoerner <br...@bretthoerner.com> > wrote: > >> I have a very weird problem that I'm going to try to describe here to see >> if anyone has any "ah-ha" moments or clues. I haven't created a small >> reproducible project for this but I guess I will have to try in the future >> if I can't figure it out. (Or I'll need to bisect by running long Hadoop >> jobs...) >> >> So, the facts: >> >> * Have been successfully using Solr mapred to build very large Solr >> clusters for months >> * As of Solr 4.10 *some* job sizes repeatably hang in the MTree merge >> phase in 4.10 >> * Those same jobs (same input, output, and Hadoop cluster itself) succeed >> if I only change my Solr deps to 4.9 >> * The job *does succeed* in 4.10 if I use the same data to create more, >> but smaller shards (e.g. 12x as many shards each 1/12th the size of the job >> that fails) >> * Creating my "normal size" shards (the size I want, that works in 4.9) >> the job hangs with 2 mappers running, 0 reducers in the MTree merge phase >> * There are no errors or warning in the syslog/stderr of the MTree >> mappers, no errors ever echo'd back to the "interactive run" of the job >> (mapper says 100%, reduce says 0%, will stay forever) >> * No CPU being used on the boxes running the merge, no GC happening, JVM >> waiting on a futex, all threads blocked on various queues >> * No disk usage problems, nothing else obviously wrong with any box in >> the cluster >> >> I diff'ed around between 4.10 and 4.9 and barely see any changes in >> mapred contrib, mostly some test stuff. I didn't see any transitive >> dependency changes in Solr/Lucene that look like they would affect me. >> > >