On 2010-06-02 13:12, Grant Ingersoll wrote: > > On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote: > >> On 2010-06-02 12:42, Grant Ingersoll wrote: >>> >>> On Jun 1, 2010, at 9:54 PM, Blargy wrote: >>> >>>> >>>> We have around 5 million items in our index and each item has a description >>>> located on a separate physical database. These item descriptions vary in >>>> size and for the most part are quite large. Currently we are only indexing >>>> items and not their corresponding description and a full import takes >>>> around >>>> 4 hours. Ideally we want to index both our items and their descriptions but >>>> after some quick profiling I determined that a full import would take in >>>> excess of 24 hours. >>>> >>>> - How would I profile the indexing process to determine if the bottleneck >>>> is >>>> Solr or our Database. >>> >>> As a data point, I routinely see clients index 5M items on normal >>> hardware in approx. 1 hour (give or take 30 minutes). >>> >>> When you say "quite large", what do you mean? Are we talking books here or >>> maybe a couple pages of text or just a couple KB of data? >>> >>> How long does it take you to get that data out (and, from the sounds of it, >>> merge it with your item) w/o going to Solr? >>> >>>> - In either case, how would one speed up this process? Is there a way to >>>> run >>>> parallel import processes and then merge them together at the end? Possibly >>>> use some sort of distributed computing? >>> >>> DataImportHandler now supports multiple threads. The absolute fastest way >>> that I know of to index is via multiple threads sending batches of >>> documents at a time (at least 100). Often, from DBs one can split up the >>> table via SQL statements that can then be fetched separately. You may want >>> to write your own multithreaded client to index. >> >> SOLR-1301 is also an option if you are familiar with Hadoop ... >> > > If the bottleneck is the DB, will that do much? >
Nope. But the workflow could be set up so that during night hours a DB export takes place that results in a CSV or SolrXML file (there you could measure the time it takes to do this export), and then indexing can work from this file. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com