Re: Importing large datasets

Blargy Wed, 02 Jun 2010 19:31:23 -0700


Lance Norskog-2 wrote:
> 
> Wait! You're fetching records from one database and then doing lookups
> against another DB? That makes this a completely different problem.
> 
> The DIH does not to my knowledge have the ability to "pool" these
> queries. That is, it will not build a batch of 1000 keys from
> datasource1 and then do a query against datasource2 with:
>     select foo where key_field IN (key1, key2,... key1000);
> 
> This is the efficient way to do what you want. You'll have to write
> your own client to do this.
> 
> On Wed, Jun 2, 2010 at 12:00 PM, David Stuart
> <david.stu...@progressivealliance.co.uk> wrote:
>> How long does it take to do a grab of all the data via SQL? I found by
>> denormalizing the data into a lookup table meant that I was able to index
>> about 300k rows of similar data size with dih regex spilting on some
>> fields
>> in about 8mins I know it's not quite the scale bit with batching...
>>
>> David Stuar
>>
>> On 2 Jun 2010, at 17:58, Blargy <zman...@hotmail.com> wrote:
>>
>>>
>>>
>>>
>>>> One thing that might help indexing speed - create a *single* SQL query
>>>> to grab all the data you need without using DIH's sub-entities, at
>>>> least the non-cached ones.
>>>>
>>>
>>> Not sure how much that would help. As I mentioned that without the item
>>> description import the full process takes 4 hours which is bearable.
>>> However
>>> once I started to import the item description which is located on a
>>> separate
>>> machine/database the import process exploded to over 24 hours.
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com
>


Whats more efficient a batch size of 1000 or -1 for MySQL? Is this why its
so slow because I am using 2 different datasources?

Say I am using just one datasource should I still be seing "Creating a
connection for entity ...." for each sub entity in the document or should it
just be using one connection?




-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866499.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Importing large datasets

Reply via email to