RE: Machine utilization while indexing

Nagelberg, Kallin Thu, 20 May 2010 08:33:57 -0700

Well to be fair I'm indexing on a modest virtualized machine with only 2 gigs 
ram, and a doc size of 5-10k maybe substantially larger than what you have. 
They could be substantially smaller too. As another point of reference my index 
ends up being about 20Gigs with the 5 million docs.


I should also point out I only need to do this once.. I'm not constantly 
reindexing everything. My indexed documents rarely change, and when they do we 
have a process that selectively updates those few that need it. Combine that 
with a constant trickle of new documents and indexing performance isn't much of 
a concern.

You should be able to experiment with a small subset of your documents to 
speedily test new schemas, etc. In my case I selected a representative sample 
and store them in my project for unit testing.

-Kallin Nagelberg


-----Original Message-----
From: Dennis Gearon [mailto:gear...@sbcglobal.net] 
Sent: Thursday, May 20, 2010 11:25 AM
To: solr-user@lucene.apache.org
Subject: RE: Machine utilization while indexing

It takes that long to do indexing? I'm HOPING to have a site that has low 10's 
of millions of documents to billions. 

Sounds to me like I will DEFINITELY need a cloud account at indexing time. For 
the original author of this thread, that's what I'd recommend.

1/ Optimize as best as you can on one machine.
2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over 
to 5-10 machines during indexing. Combine the index, shut down the EC 
instances. Probably could get it down to 1/2 hour, without impacting your 
current queries.


Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 5/20/10, Nagelberg, Kallin <knagelb...@globeandmail.com> wrote:

> From: Nagelberg, Kallin <knagelb...@globeandmail.com>
> Subject: RE: Machine utilization while indexing
> To: "'solr-user@lucene.apache.org'" <solr-user@lucene.apache.org>
> Date: Thursday, May 20, 2010, 8:16 AM
> How about throwing a blockingqueue,
> http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
> between your document-creator and solrserver? Give it a size
> of 10,000 or something, with one thread trying to feed it,
> and one thread waiting for it to get near full then draining
> it. Take the drained results and add them to the server
> (maybe try not using streamingsolrserver). Something like
> that worked well for me with about 5,000,000 documents each
> ~5k taking about 8 hours.
> 
> -Kallin Nagelberg
> 
> -----Original Message-----
> From: Thijs [mailto:vonk.th...@gmail.com]
> 
> Sent: Thursday, May 20, 2010 11:02 AM
> To: solr-user@lucene.apache.org
> Subject: Machine utilization while indexing
> 
> Hi.
> 
> I have a question about how I can get solr to index quicker
> then it does 
> at the moment.
> 
> I have to index (and re-index) some 3-5 million documents.
> These 
> documents are preprocessed by a java application that
> effectively 
> combines multiple database tables with each-other to form
> the 
> SolrInputDocument.
> 
> What I'm seeing however is that the queue of documents that
> are ready to 
> be send to the solr server exceeds my preset limit. Telling
> me that Solr 
> somehow can't process the documents fast enough.
> 
> (I have created my own queue in front of
> Solrj.StreamingUpdateSolrServer 
> as it would not process the documents fast enough causing 
> OutOfMemoryExceptions due to the large amount of documents
> building up 
> in it's queue)
> 
> I have an index that for 95% consist of ID's (Long). We
> don't do any 
> analysis on the fields that are being indexed. The schema
> is rather 
> straight forward.
> 
> most fields look like
> <fieldType name="long" class="solr.LongField"
> omitNorms="true"/>
> <field name="objectId" type="long" stored="true"
> indexed="true" 
> required="true" />
> <field name="listId" type="long" stored="false"
> indexed="true" 
> multiValued="true"/>
> 
> the relevant solrconfig.xml
> <indexDefaults>
>  
>    <useCompoundFile>false</useCompoundFile>
>  
>    <mergeFactor>100</mergeFactor>
>  
>    <RAMBufferSizeMB>256</RAMBufferSizeMB>
>  
>    <maxMergeDocs>2147483647</maxMergeDocs>
>  
>    <maxFieldLength>10000</maxFieldLength>
>  
>    <writeLockTimeout>1000</writeLockTimeout>
>  
>    <commitLockTimeout>10000</commitLockTimeout>
>  
>    <lockType>single</lockType>
> </indexDefaults>
> 
> 
> The machines I'm testing on have a:
> Intel(R) Core(TM)2 Quad CPU    Q9550  @
> 2.83GHz
> With 4GB of ram.
> Running on linux java version 1.6.0_17, tomcat 6 and solr
> version 1.4
> 
> What I'm seeing is that the network almost never reaches
> more then 10% 
> of the 1GB/s connection.
> That the CPU utilization is always below 25% (1 core is
> used, not the 
> others)
> I don't see heavy disk-io.
> Also while indexing the memory consumption is:
> Free memory: 212.15 MB Total memory: 509.12 MB Max memory:
> 2730.68 MB
> 
> And that in the beginning (with a empty index) I get 2ms
> per insert but 
> this slows to 18-19ms per insert.
> 
> Are there any tips/tricks I can use to speed up my
> indexing? Because I 
> have a feeling that my machine is capable of doing more
> (use more 
> cpu's). I just can't figure-out how.
> 
> Thijs
>

RE: Machine utilization while indexing

Reply via email to