Re: MergeReduceIndexerTool takes a lot of time for a limited number of documents

Erick Erickson Mon, 26 May 2014 09:46:35 -0700

The MapReduceIndexerTool is really intended for very large data sets,
and by today's standards 80K doesn't qualify :).


Basically, MRIT creates N sub-indexes, then merges them, which it
may to in a tiered fashion. That is, it may merge gen1 to gen2, then
merge gen2 to gen3 etc. Which is great when indexing a bazillion
documents into 20 shards, but all that copying around may take
more time than you really gain for 80K docs.

Also be aware that MRIT does NOT update docs with the same ID, this
is due to the inherent limitation of the Lucene mergeIndex process.

How long is "a long time"? attachments tend to get filtered out, so if you
want us to see the graph you might paste it somewhere and provide a link.

Best,
Erick

On Mon, May 26, 2014 at 8:51 AM, Costi Muraru <costimur...@gmail.com> wrote:
> Hey guys,
>
> I'm using the MergeReduceIndexerTool to import data into a SolrCloud
> cluster made out of 3 decent machines.
> Looking in the JobTracker, I can see that the mapper jobs finish quite
> fast. The reduce jobs get to ~80% quite fast as well. It is here where
> they get stucked for a long period of time (picture + log attached).
> I'm only trying to insert ~80k documents with 10-50 different fields
> each. Why is this happening? Am I not setting something correctly? Is
> the fact that most of the documents have different field names, or too
> many for that matter?
> Any tips are gladly appreciated.
>
> Thanks,
> Costi
>
> From the reduce logs:
> 60208 [main] INFO  org.apache.solr.update.UpdateHandler  - start
> commit{,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=false,prepareCommit=false}
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [IW][main]: commit: start
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [IW][main]: commit: enter lock
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [IW][main]: commit: now prepare
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [IW][main]: prepareCommit: flush
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [IW][main]:   index before flush
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [DW][main]: main startFullFlush
> 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [DW][main]: anyChanges? numDocsInRam=25603 deletes=true
> hasTickets:false pendingChangesInFullFlush: false
> 60209 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [DWFC][main]: addFlushableState DocumentsWriterPerThread
> [pendingDeletes=gen=0 25602 deleted terms (unique count=25602)
> bytesUsed=5171604, segment=_0, aborting=false, numDocsInRAM=25603,
> deleteQueue=DWDQ: [ generation: 0 ]]
> 61542 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> [DWPT][main]: flush postings as segment _0 numDocs=25603
> 61664 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 125115 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 199408 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 271088 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 336754 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 417810 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 479495 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 552357 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 621450 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
> 683173 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> heart beat for 1 threads
>
> This is the run command I'm using:
> hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
> org.apache.solr.hadoop.MapReduceIndexerTool \
>  --log4j /home/cmuraru/solr/log4j.properties \
>  --morphline-file morphline.conf \
>  --output-dir hdfs://nameservice1:8020/tmp/outdir \
>  --verbose --go-live --zk-host localhost:2181/solr \
>  --collection collection1 \
> hdfs://nameservice1:8020/tmp/indir

Re: MergeReduceIndexerTool takes a lot of time for a limited number of documents

Reply via email to