Re: Dataimport performance

Alexey Serba Sun, 19 Dec 2010 14:31:02 -0800

> With subquery and with left join:   320k in 6 Min 30
It's 820 records per second. It's _really_ impressive considering the
fact that DIH performs separate sql query for every record in your
case.


>> So there's one track entity with an artist sub-entity. My (admittedly
>> rather limited) experience has been that sub-entities, where you have
>> to run a separate query for every row in the parent entity, really
>> slow down data import.
Sub entities slows down data import indeed. You can try to avoid
separate query for every row by using CachedSqlEntityProcessor. There
are couple of options - 1) you can load all sub-entity data in memory
or 2) you can reduce the number of sql queries by caching sub entity
data per id. There's no silver bullet and each option has its own pros
and cons.

Also Ephraim proposed a really neat solution with GROUP_CONCAT, but
I'm not sure that all RDBMS-es support that.


2010/12/15 Robert Gründler <rob...@dubture.com>:
> i've benchmarked the import already with 500k records, one time without the 
> artists subquery, and one time without the join in the main query:
>
>
> Without subquery: 500k in 3 min 30 sec
>
> Without join and without subquery: 500k in 2 min 30.
>
> With subquery and with left join:   320k in 6 Min 30
>
>
> so the joins / subqueries are definitely a bottleneck.
>
> How exactly did you implement the custom data import?
>
> In our case, we need to de-normalize the relations of the sql data for the 
> index,
> so i fear i can't really get rid of the join / subquery.
>
>
> -robert
>
>
>
>
>
> On Dec 15, 2010, at 15:43 , Tim Heckman wrote:
>
>> 2010/12/15 Robert Gründler <rob...@dubture.com>:
>>> The data-config.xml looks like this (only 1 entity):
>>>
>>>      <entity name="track" query="select t.id as id, t.title as title, 
>>> l.title as label from track t left join label l on (l.id = t.label_id) 
>>> where t.deleted = 0" transformer="TemplateTransformer">
>>>        <field column="title" name="title_t" />
>>>        <field column="label" name="label_t" />
>>>        <field column="id" name="sf_meta_id" />
>>>        <field column="metaclass" template="Track" name="sf_meta_class"/>
>>>        <field column="metaid" template="${track.id}" name="sf_meta_id"/>
>>>        <field column="uniqueid" template="Track_${track.id}" 
>>> name="sf_unique_id"/>
>>>
>>>        <entity name="artists" query="select a.name as artist from artist a 
>>> left join track_artist ta on (ta.artist_id = a.id) where 
>>> ta.track_id=${track.id}">
>>>          <field column="artist" name="artists_t" />
>>>        </entity>
>>>
>>>      </entity>
>>
>> So there's one track entity with an artist sub-entity. My (admittedly
>> rather limited) experience has been that sub-entities, where you have
>> to run a separate query for every row in the parent entity, really
>> slow down data import. For my own purposes, I wrote a custom data
>> import using SolrJ to improve the performance (from 3 hours to 10
>> minutes).
>>
>> Just as a test, how long does it take if you comment out the artists entity?
>
>

Re: Dataimport performance

Reply via email to