Well to be fair I'm indexing on a modest virtualized machine with only 2 gigs ram, and a doc size of 5-10k maybe substantially larger than what you have. They could be substantially smaller too. As another point of reference my index ends up being about 20Gigs with the 5 million docs.
I should also point out I only need to do this once.. I'm not constantly reindexing everything. My indexed documents rarely change, and when they do we have a process that selectively updates those few that need it. Combine that with a constant trickle of new documents and indexing performance isn't much of a concern. You should be able to experiment with a small subset of your documents to speedily test new schemas, etc. In my case I selected a representative sample and store them in my project for unit testing. -Kallin Nagelberg -----Original Message----- From: Dennis Gearon [mailto:gear...@sbcglobal.net] Sent: Thursday, May 20, 2010 11:25 AM To: solr-user@lucene.apache.org Subject: RE: Machine utilization while indexing It takes that long to do indexing? I'm HOPING to have a site that has low 10's of millions of documents to billions. Sounds to me like I will DEFINITELY need a cloud account at indexing time. For the original author of this thread, that's what I'd recommend. 1/ Optimize as best as you can on one machine. 2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over to 5-10 machines during indexing. Combine the index, shut down the EC instances. Probably could get it down to 1/2 hour, without impacting your current queries. Dennis Gearon Signature Warning ---------------- EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 5/20/10, Nagelberg, Kallin <knagelb...@globeandmail.com> wrote: > From: Nagelberg, Kallin <knagelb...@globeandmail.com> > Subject: RE: Machine utilization while indexing > To: "'solr-user@lucene.apache.org'" <solr-user@lucene.apache.org> > Date: Thursday, May 20, 2010, 8:16 AM > How about throwing a blockingqueue, > http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, > between your document-creator and solrserver? Give it a size > of 10,000 or something, with one thread trying to feed it, > and one thread waiting for it to get near full then draining > it. Take the drained results and add them to the server > (maybe try not using streamingsolrserver). Something like > that worked well for me with about 5,000,000 documents each > ~5k taking about 8 hours. > > -Kallin Nagelberg > > -----Original Message----- > From: Thijs [mailto:vonk.th...@gmail.com] > > Sent: Thursday, May 20, 2010 11:02 AM > To: solr-user@lucene.apache.org > Subject: Machine utilization while indexing > > Hi. > > I have a question about how I can get solr to index quicker > then it does > at the moment. > > I have to index (and re-index) some 3-5 million documents. > These > documents are preprocessed by a java application that > effectively > combines multiple database tables with each-other to form > the > SolrInputDocument. > > What I'm seeing however is that the queue of documents that > are ready to > be send to the solr server exceeds my preset limit. Telling > me that Solr > somehow can't process the documents fast enough. > > (I have created my own queue in front of > Solrj.StreamingUpdateSolrServer > as it would not process the documents fast enough causing > OutOfMemoryExceptions due to the large amount of documents > building up > in it's queue) > > I have an index that for 95% consist of ID's (Long). We > don't do any > analysis on the fields that are being indexed. The schema > is rather > straight forward. > > most fields look like > <fieldType name="long" class="solr.LongField" > omitNorms="true"/> > <field name="objectId" type="long" stored="true" > indexed="true" > required="true" /> > <field name="listId" type="long" stored="false" > indexed="true" > multiValued="true"/> > > the relevant solrconfig.xml > <indexDefaults> > > <useCompoundFile>false</useCompoundFile> > > <mergeFactor>100</mergeFactor> > > <RAMBufferSizeMB>256</RAMBufferSizeMB> > > <maxMergeDocs>2147483647</maxMergeDocs> > > <maxFieldLength>10000</maxFieldLength> > > <writeLockTimeout>1000</writeLockTimeout> > > <commitLockTimeout>10000</commitLockTimeout> > > <lockType>single</lockType> > </indexDefaults> > > > The machines I'm testing on have a: > Intel(R) Core(TM)2 Quad CPU Q9550 @ > 2.83GHz > With 4GB of ram. > Running on linux java version 1.6.0_17, tomcat 6 and solr > version 1.4 > > What I'm seeing is that the network almost never reaches > more then 10% > of the 1GB/s connection. > That the CPU utilization is always below 25% (1 core is > used, not the > others) > I don't see heavy disk-io. > Also while indexing the memory consumption is: > Free memory: 212.15 MB Total memory: 509.12 MB Max memory: > 2730.68 MB > > And that in the beginning (with a empty index) I get 2ms > per insert but > this slows to 18-19ms per insert. > > Are there any tips/tricks I can use to speed up my > indexing? Because I > have a feeling that my machine is capable of doing more > (use more > cpu's). I just can't figure-out how. > > Thijs >