Re: MergeReduceIndexerTool takes a lot of time for a limited number of documents

Costi Muraru Mon, 26 May 2014 10:49:26 -0700

Hey Erick,

The job reducers began to die with "Error: Java heap space", after 1h and
22 minutes being stucked at ~80%.


I did a few more tests:

Test 1.
80,000 documents
Each document had *20* fields. The field names were* the same *for all the
documents. Values were different.
Job status: successful
Execution time: 33 seconds.

Test 2.
80,000 documents
Each document had *20* fields. The field names were *different* for all the
documents. Values were also different.
Job status: successful
Execution time: 643 seconds.

Test 3.
80,000 documents
Each document had *50* fields. The field names were *the same* for all the
documents. Values were different.
Job status: successful
Execution time: 45.96 seconds.

Test 4.
80,000 documents
Each document had *50* fields. The field names were *different* for all the
documents. Values were also different.
Job status: failed
Execution time: after 1h reducers failed.
Unfortunately, this is my use case.

My guess is that the reduce time (to perform the merges) depends if the
field names are the same across the documents. If they are different the
merge time increases very much. I don't have any knowledge behind the solr
merge operation, but is it possible that it tries to group the fields with
the same name across all the documents?
In the first case, when the field names are the same across documents, the
number of buckets is equal to the number of unique field names which is 20.
In the second case, where all the field names are different (my use case),
it creates a lot more buckets (80k documents * 50 different field names = 4
million buckets) and the process gets slowed down significantly.
Is this assumption correct / Is there any way to get around it?

Thanks again for reaching out. Hope this is more clear now.

This is how one of the 80k documents looks like (json format):
{
"id" : "442247098240414508034066540706561683636",
"items" : {
   "IT49597_1180_i" : 76,
   "IT25363_1218_i" : 4,
   "IT12418_1291_i" : 95,
   "IT55979_1051_i" : 31,
   "IT9841_1224_i" : 36,
   "IT40463_1010_i" : 87,
   "IT37932_1346_i" : 11,
   "IT17653_1054_i" : 37,
   "IT59414_1025_i" : 96,
   "IT51080_1133_i" : 5,
   "IT7369_1395_i" : 90,
   "IT59974_1245_i" : 25,
   "IT25374_1345_i" : 75,
   "IT16825_1458_i" : 28,
   "IT56643_1050_i" : 76,
   "IT46274_1398_i" : 50,
   "IT47411_1275_i" : 11,
   "IT2791_1000_i" : 97,
   "IT7708_1053_i" : 96,
   "IT46622_1112_i" : 90,
   "IT47161_1382_i" : 64
   }
}

Costi


On Mon, May 26, 2014 at 7:45 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> The MapReduceIndexerTool is really intended for very large data sets,
> and by today's standards 80K doesn't qualify :).
>
> Basically, MRIT creates N sub-indexes, then merges them, which it
> may to in a tiered fashion. That is, it may merge gen1 to gen2, then
> merge gen2 to gen3 etc. Which is great when indexing a bazillion
> documents into 20 shards, but all that copying around may take
> more time than you really gain for 80K docs.
>
> Also be aware that MRIT does NOT update docs with the same ID, this
> is due to the inherent limitation of the Lucene mergeIndex process.
>
> How long is "a long time"? attachments tend to get filtered out, so if you
> want us to see the graph you might paste it somewhere and provide a link.
>
> Best,
> Erick
>
> On Mon, May 26, 2014 at 8:51 AM, Costi Muraru <costimur...@gmail.com>
> wrote:
> > Hey guys,
> >
> > I'm using the MergeReduceIndexerTool to import data into a SolrCloud
> > cluster made out of 3 decent machines.
> > Looking in the JobTracker, I can see that the mapper jobs finish quite
> > fast. The reduce jobs get to ~80% quite fast as well. It is here where
> > they get stucked for a long period of time (picture + log attached).
> > I'm only trying to insert ~80k documents with 10-50 different fields
> > each. Why is this happening? Am I not setting something correctly? Is
> > the fact that most of the documents have different field names, or too
> > many for that matter?
> > Any tips are gladly appreciated.
> >
> > Thanks,
> > Costi
> >
> > From the reduce logs:
> > 60208 [main] INFO  org.apache.solr.update.UpdateHandler  - start
> >
> commit{,optimize=false,openSearcher=true,waitSearcher=false,expungeDeletes=false,softCommit=false,prepareCommit=false}
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [IW][main]: commit: start
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [IW][main]: commit: enter lock
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [IW][main]: commit: now prepare
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [IW][main]: prepareCommit: flush
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [IW][main]:   index before flush
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [DW][main]: main startFullFlush
> > 60208 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [DW][main]: anyChanges? numDocsInRam=25603 deletes=true
> > hasTickets:false pendingChangesInFullFlush: false
> > 60209 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [DWFC][main]: addFlushableState DocumentsWriterPerThread
> > [pendingDeletes=gen=0 25602 deleted terms (unique count=25602)
> > bytesUsed=5171604, segment=_0, aborting=false, numDocsInRAM=25603,
> > deleteQueue=DWDQ: [ generation: 0 ]]
> > 61542 [main] INFO  org.apache.solr.update.LoggingInfoStream  -
> > [DWPT][main]: flush postings as segment _0 numDocs=25603
> > 61664 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> > heart beat for 1 threads
> > 125115 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> > heart beat for 1 threads
> > 199408 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> > heart beat for 1 threads
> > 271088 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> > heart beat for 1 threads
> > 336754 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> > heart beat for 1 threads
> > 417810 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> > heart beat for 1 threads
> > 479495 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> > heart beat for 1 threads
> > 552357 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> > heart beat for 1 threads
> > 621450 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> > heart beat for 1 threads
> > 683173 [Thread-32] INFO  org.apache.solr.hadoop.HeartBeater  - Issuing
> > heart beat for 1 threads
> >
> > This is the run command I'm using:
> > hadoop jar
> /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar
> > org.apache.solr.hadoop.MapReduceIndexerTool \
> >  --log4j /home/cmuraru/solr/log4j.properties \
> >  --morphline-file morphline.conf \
> >  --output-dir hdfs://nameservice1:8020/tmp/outdir \
> >  --verbose --go-live --zk-host localhost:2181/solr \
> >  --collection collection1 \
> > hdfs://nameservice1:8020/tmp/indir
>

Re: MergeReduceIndexerTool takes a lot of time for a limited number of documents

Reply via email to