Well, I hope to have around 5 million datasets/documents within 1 year, so this is good info. BUT if I DO have that many, then the market I am aiming at will end giving me 100 times more than than within 2 years.
Are there good references/books on using Solr/Lucen/(linux/nginx) for 500 million plus documents? The data is easily shardible geographially, as one given. Dennis Gearon Signature Warning ---------------- EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 6/2/10, Grant Ingersoll <gsing...@apache.org> wrote: > From: Grant Ingersoll <gsing...@apache.org> > Subject: Re: Importing large datasets > To: solr-user@lucene.apache.org > Date: Wednesday, June 2, 2010, 3:42 AM > > On Jun 1, 2010, at 9:54 PM, Blargy wrote: > > > > > We have around 5 million items in our index and each > item has a description > > located on a separate physical database. These item > descriptions vary in > > size and for the most part are quite large. Currently > we are only indexing > > items and not their corresponding description and a > full import takes around > > 4 hours. Ideally we want to index both our items and > their descriptions but > > after some quick profiling I determined that a full > import would take in > > excess of 24 hours. > > > > - How would I profile the indexing process to > determine if the bottleneck is > > Solr or our Database. > > As a data point, I routinely see clients index 5M items on > normal > hardware in approx. 1 hour (give or take 30 minutes). > > > When you say "quite large", what do you mean? Are we > talking books here or maybe a couple pages of text or just a > couple KB of data? > > How long does it take you to get that data out (and, from > the sounds of it, merge it with your item) w/o going to > Solr? > > > - In either case, how would one speed up this process? > Is there a way to run > > parallel import processes and then merge them together > at the end? Possibly > > use some sort of distributed computing? > > DataImportHandler now supports multiple threads. The > absolute fastest way that I know of to index is via multiple > threads sending batches of documents at a time (at least > 100). Often, from DBs one can split up the table via > SQL statements that can then be fetched separately. > You may want to write your own multithreaded client to > index. > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > >