When adding data continuously, that data is available after committing and is 
indexed, right?
If so, how often is reindexing do some good?

Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Andrzej Bialecki <a...@getopt.org> wrote:

> From: Andrzej Bialecki <a...@getopt.org>
> Subject: Re: Importing large datasets
> To: solr-user@lucene.apache.org
> Date: Wednesday, June 2, 2010, 4:52 AM
> On 2010-06-02 13:12, Grant Ingersoll
> wrote:
> > 
> > On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:
> > 
> >> On 2010-06-02 12:42, Grant Ingersoll wrote:
> >>>
> >>> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
> >>>
> >>>>
> >>>> We have around 5 million items in our
> index and each item has a description
> >>>> located on a separate physical database.
> These item descriptions vary in
> >>>> size and for the most part are quite
> large. Currently we are only indexing
> >>>> items and not their corresponding
> description and a full import takes around
> >>>> 4 hours. Ideally we want to index both our
> items and their descriptions but
> >>>> after some quick profiling I determined
> that a full import would take in
> >>>> excess of 24 hours. 
> >>>>
> >>>> - How would I profile the indexing process
> to determine if the bottleneck is
> >>>> Solr or our Database.
> >>>
> >>> As a data point, I routinely see clients index
> 5M items on normal
> >>> hardware in approx. 1 hour (give or take 30
> minutes).  
> >>>
> >>> When you say "quite large", what do you
> mean?  Are we talking books here or maybe a couple
> pages of text or just a couple KB of data?
> >>>
> >>> How long does it take you to get that data out
> (and, from the sounds of it, merge it with your item) w/o
> going to Solr?
> >>>
> >>>> - In either case, how would one speed up
> this process? Is there a way to run
> >>>> parallel import processes and then merge
> them together at the end? Possibly
> >>>> use some sort of distributed computing?
> >>>
> >>> DataImportHandler now supports multiple
> threads.  The absolute fastest way that I know of to
> index is via multiple threads sending batches of documents
> at a time (at least 100).  Often, from DBs one can
> split up the table via SQL statements that can then be
> fetched separately.  You may want to write your own
> multithreaded client to index.
> >>
> >> SOLR-1301 is also an option if you are familiar
> with Hadoop ...
> >>
> > 
> > If the bottleneck is the DB, will that do much?
> > 
> 
> Nope. But the workflow could be set up so that during night
> hours a DB
> export takes place that results in a CSV or SolrXML file
> (there you
> could measure the time it takes to do this export), and
> then indexing
> can work from this file.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _
> _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic
> Web
> ___|||__||  \|  ||  |  Embedded Unix,
> System Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
>

Reply via email to