The MapReduceIndexerTool is really intended for very large data sets, and by today's standards 80K doesn't qualify :).
Basically, MRIT creates N sub-indexes, then merges them, which it may to in a tiered fashion. That is, it may merge gen1 to gen2, then merge gen2 to gen3 etc. Which is great when indexing a bazillion documents into 20 shards, but all that copying around may take more time than you really gain for 80K docs. Also be aware that MRIT does NOT update docs with the same ID, this is due to the inherent limitation of the Lucene mergeIndex process. How long is "a long time"? attachments tend to get filtered out, so if you want us to see the graph you might paste it somewhere and provide a link. Best, Erick On Mon, May 26, 2014 at 8:51 AM, Costi Muraru <costimur...@gmail.com> wrote: > Hey guys, > > I'm using the MergeReduceIndexerTool to import data into a SolrCloud > cluster made out of 3 decent machines. > Looking in the JobTracker, I can see that the mapper jobs finish quite > fast. The reduce jobs get to ~80% quite fast as well. It is here where > they get stucked for a long period of time (picture + log attached). > I'm only trying to insert ~80k documents with 10-50 different fields > each. Why is this happening? Am I not setting something correctly? Is > the fact that most of the documents have different field names, or too > many for that matter? > Any tips are gladly appreciated. > > Thanks, > Costi > > From the reduce logs: > 60208 [main] INFO org.apache.solr.update.UpdateHandler - start > commit{,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=false,prepareCommit=false} > 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - > [IW][main]: commit: start > 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - > [IW][main]: commit: enter lock > 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - > [IW][main]: commit: now prepare > 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - > [IW][main]: prepareCommit: flush > 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - > [IW][main]: index before flush > 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - > [DW][main]: main startFullFlush > 60208 [main] INFO org.apache.solr.update.LoggingInfoStream - > [DW][main]: anyChanges? numDocsInRam=25603 deletes=true > hasTickets:false pendingChangesInFullFlush: false > 60209 [main] INFO org.apache.solr.update.LoggingInfoStream - > [DWFC][main]: addFlushableState DocumentsWriterPerThread > [pendingDeletes=gen=0 25602 deleted terms (unique count=25602) > bytesUsed=5171604, segment=_0, aborting=false, numDocsInRAM=25603, > deleteQueue=DWDQ: [ generation: 0 ]] > 61542 [main] INFO org.apache.solr.update.LoggingInfoStream - > [DWPT][main]: flush postings as segment _0 numDocs=25603 > 61664 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - Issuing > heart beat for 1 threads > 125115 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - Issuing > heart beat for 1 threads > 199408 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - Issuing > heart beat for 1 threads > 271088 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - Issuing > heart beat for 1 threads > 336754 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - Issuing > heart beat for 1 threads > 417810 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - Issuing > heart beat for 1 threads > 479495 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - Issuing > heart beat for 1 threads > 552357 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - Issuing > heart beat for 1 threads > 621450 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - Issuing > heart beat for 1 threads > 683173 [Thread-32] INFO org.apache.solr.hadoop.HeartBeater - Issuing > heart beat for 1 threads > > This is the run command I'm using: > hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar > org.apache.solr.hadoop.MapReduceIndexerTool \ > --log4j /home/cmuraru/solr/log4j.properties \ > --morphline-file morphline.conf \ > --output-dir hdfs://nameservice1:8020/tmp/outdir \ > --verbose --go-live --zk-host localhost:2181/solr \ > --collection collection1 \ > hdfs://nameservice1:8020/tmp/indir