When adding data continuously, that data is available after committing and is indexed, right?
If so, how often is reindexing do some good? Dennis Gearon Signature Warning ---------------- EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 6/2/10, Andrzej Bialecki <a...@getopt.org> wrote: > From: Andrzej Bialecki <a...@getopt.org> > Subject: Re: Importing large datasets > To: solr-user@lucene.apache.org > Date: Wednesday, June 2, 2010, 4:52 AM > On 2010-06-02 13:12, Grant Ingersoll > wrote: > > > > On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote: > > > >> On 2010-06-02 12:42, Grant Ingersoll wrote: > >>> > >>> On Jun 1, 2010, at 9:54 PM, Blargy wrote: > >>> > >>>> > >>>> We have around 5 million items in our > index and each item has a description > >>>> located on a separate physical database. > These item descriptions vary in > >>>> size and for the most part are quite > large. Currently we are only indexing > >>>> items and not their corresponding > description and a full import takes around > >>>> 4 hours. Ideally we want to index both our > items and their descriptions but > >>>> after some quick profiling I determined > that a full import would take in > >>>> excess of 24 hours. > >>>> > >>>> - How would I profile the indexing process > to determine if the bottleneck is > >>>> Solr or our Database. > >>> > >>> As a data point, I routinely see clients index > 5M items on normal > >>> hardware in approx. 1 hour (give or take 30 > minutes). > >>> > >>> When you say "quite large", what do you > mean? Are we talking books here or maybe a couple > pages of text or just a couple KB of data? > >>> > >>> How long does it take you to get that data out > (and, from the sounds of it, merge it with your item) w/o > going to Solr? > >>> > >>>> - In either case, how would one speed up > this process? Is there a way to run > >>>> parallel import processes and then merge > them together at the end? Possibly > >>>> use some sort of distributed computing? > >>> > >>> DataImportHandler now supports multiple > threads. The absolute fastest way that I know of to > index is via multiple threads sending batches of documents > at a time (at least 100). Often, from DBs one can > split up the table via SQL statements that can then be > fetched separately. You may want to write your own > multithreaded client to index. > >> > >> SOLR-1301 is also an option if you are familiar > with Hadoop ... > >> > > > > If the bottleneck is the DB, will that do much? > > > > Nope. But the workflow could be set up so that during night > hours a DB > export takes place that results in a CSV or SolrXML file > (there you > could measure the time it takes to do this export), and > then indexing > can work from this file. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ > _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic > Web > ___|||__|| \| || | Embedded Unix, > System Integration > http://www.sigram.com Contact: info at sigram dot > com > >