Re: Importing large datasets

Andrzej Bialecki Wed, 02 Jun 2010 04:53:04 -0700

On 2010-06-02 13:12, Grant Ingersoll wrote:
> 
> On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:
> 
>> On 2010-06-02 12:42, Grant Ingersoll wrote:
>>>
>>> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
>>>
>>>>
>>>> We have around 5 million items in our index and each item has a description
>>>> located on a separate physical database. These item descriptions vary in
>>>> size and for the most part are quite large. Currently we are only indexing
>>>> items and not their corresponding description and a full import takes 
>>>> around
>>>> 4 hours. Ideally we want to index both our items and their descriptions but
>>>> after some quick profiling I determined that a full import would take in
>>>> excess of 24 hours. 
>>>>
>>>> - How would I profile the indexing process to determine if the bottleneck 
>>>> is
>>>> Solr or our Database.
>>>
>>> As a data point, I routinely see clients index 5M items on normal
>>> hardware in approx. 1 hour (give or take 30 minutes).  
>>>
>>> When you say "quite large", what do you mean?  Are we talking books here or 
>>> maybe a couple pages of text or just a couple KB of data?
>>>
>>> How long does it take you to get that data out (and, from the sounds of it, 
>>> merge it with your item) w/o going to Solr?
>>>
>>>> - In either case, how would one speed up this process? Is there a way to 
>>>> run
>>>> parallel import processes and then merge them together at the end? Possibly
>>>> use some sort of distributed computing?
>>>
>>> DataImportHandler now supports multiple threads.  The absolute fastest way 
>>> that I know of to index is via multiple threads sending batches of 
>>> documents at a time (at least 100).  Often, from DBs one can split up the 
>>> table via SQL statements that can then be fetched separately.  You may want 
>>> to write your own multithreaded client to index.
>>
>> SOLR-1301 is also an option if you are familiar with Hadoop ...
>>
> 
> If the bottleneck is the DB, will that do much?
>


Nope. But the workflow could be set up so that during night hours a DB
export takes place that results in a CSV or SolrXML file (there you
could measure the time it takes to do this export), and then indexing
can work from this file.


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Importing large datasets

Reply via email to