Re: Dataimport performance

Lance Norskog Wed, 15 Dec 2010 20:26:45 -0800

Can you do just one join in the top-level query? The DIH does not have
a batching mechanism for these joins, but your database does.


On Wed, Dec 15, 2010 at 7:11 AM, Tim Heckman <theck...@gmail.com> wrote:
> The custom import I wrote is a java application that uses the SolrJ
> library. Basically, where I had sub-entities in the DIH config I did
> the mappings inside my java code.
>
> 1. Identify a subset or "chunk" of the primary id's to work on (so I
> don't have to load everything into memory at once) and put those in a
> temp table. I used a modulus on the id.
> 2. Select all of the outer entity from the database (joining on the
> id's in the temp table), and load the data from that result set into
> new solr input documents. I keep these in a hash map keyed on the
> id's.
> 3. Then select all of the inner entity, joining on the id's from the
> temp table. The result set has to include the id's from step 2. I go
> through this result set and load the data into the matching solr input
> documents from step 2.
> 4. Push that set of input documents to solr (optionally committing
> them), then go back to step 1 using the next subset or chunk.
>
> Not sure if this is the absolute best approach, but it's working well
> enough for my specific case.
>
> Tim
>
>
> 2010/12/15 Robert Gründler <rob...@dubture.com>:
>> i've benchmarked the import already with 500k records, one time without the 
>> artists subquery, and one time without the join in the main query:
>>
>>
>> Without subquery: 500k in 3 min 30 sec
>>
>> Without join and without subquery: 500k in 2 min 30.
>>
>> With subquery and with left join:   320k in 6 Min 30
>>
>>
>> so the joins / subqueries are definitely a bottleneck.
>>
>> How exactly did you implement the custom data import?
>>
>> In our case, we need to de-normalize the relations of the sql data for the 
>> index,
>> so i fear i can't really get rid of the join / subquery.
>>
>>
>> -robert
>>
>>
>>
>>
>>
>> On Dec 15, 2010, at 15:43 , Tim Heckman wrote:
>>
>>> 2010/12/15 Robert Gründler <rob...@dubture.com>:
>>>> The data-config.xml looks like this (only 1 entity):
>>>>
>>>>      <entity name="track" query="select t.id as id, t.title as title, 
>>>> l.title as label from track t left join label l on (l.id = t.label_id) 
>>>> where t.deleted = 0" transformer="TemplateTransformer">
>>>>        <field column="title" name="title_t" />
>>>>        <field column="label" name="label_t" />
>>>>        <field column="id" name="sf_meta_id" />
>>>>        <field column="metaclass" template="Track" name="sf_meta_class"/>
>>>>        <field column="metaid" template="${track.id}" name="sf_meta_id"/>
>>>>        <field column="uniqueid" template="Track_${track.id}" 
>>>> name="sf_unique_id"/>
>>>>
>>>>        <entity name="artists" query="select a.name as artist from artist a 
>>>> left join track_artist ta on (ta.artist_id = a.id) where 
>>>> ta.track_id=${track.id}">
>>>>          <field column="artist" name="artists_t" />
>>>>        </entity>
>>>>
>>>>      </entity>
>>>
>>> So there's one track entity with an artist sub-entity. My (admittedly
>>> rather limited) experience has been that sub-entities, where you have
>>> to run a separate query for every row in the parent entity, really
>>> slow down data import. For my own purposes, I wrote a custom data
>>> import using SolrJ to improve the performance (from 3 hours to 10
>>> minutes).
>>>
>>> Just as a test, how long does it take if you comment out the artists entity?
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: Dataimport performance

Reply via email to