Well, I hope to have around 5 million datasets/documents within 1 year, so this 
is good info. BUT if I DO have that many, then the market I am aiming at will 
end giving me 100 times more than than within 2 years.

Are there good references/books on using Solr/Lucen/(linux/nginx) for 500 
million plus documents? The data is easily shardible geographially, as one 
given.

Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Grant Ingersoll <gsing...@apache.org> wrote:

> From: Grant Ingersoll <gsing...@apache.org>
> Subject: Re: Importing large datasets
> To: solr-user@lucene.apache.org
> Date: Wednesday, June 2, 2010, 3:42 AM
> 
> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
> 
> > 
> > We have around 5 million items in our index and each
> item has a description
> > located on a separate physical database. These item
> descriptions vary in
> > size and for the most part are quite large. Currently
> we are only indexing
> > items and not their corresponding description and a
> full import takes around
> > 4 hours. Ideally we want to index both our items and
> their descriptions but
> > after some quick profiling I determined that a full
> import would take in
> > excess of 24 hours. 
> > 
> > - How would I profile the indexing process to
> determine if the bottleneck is
> > Solr or our Database.
> 
> As a data point, I routinely see clients index 5M items on
> normal
> hardware in approx. 1 hour (give or take 30 minutes). 
> 
> 
> When you say "quite large", what do you mean?  Are we
> talking books here or maybe a couple pages of text or just a
> couple KB of data?
> 
> How long does it take you to get that data out (and, from
> the sounds of it, merge it with your item) w/o going to
> Solr?
> 
> > - In either case, how would one speed up this process?
> Is there a way to run
> > parallel import processes and then merge them together
> at the end? Possibly
> > use some sort of distributed computing?
> 
> DataImportHandler now supports multiple threads.  The
> absolute fastest way that I know of to index is via multiple
> threads sending batches of documents at a time (at least
> 100).  Often, from DBs one can split up the table via
> SQL statements that can then be fetched separately. 
> You may want to write your own multithreaded client to
> index.
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem using Solr/Lucene: 
> http://www.lucidimagination.com/search
> 
>

Reply via email to