Can you do just one join in the top-level query? The DIH does not have a batching mechanism for these joins, but your database does.
On Wed, Dec 15, 2010 at 7:11 AM, Tim Heckman <theck...@gmail.com> wrote: > The custom import I wrote is a java application that uses the SolrJ > library. Basically, where I had sub-entities in the DIH config I did > the mappings inside my java code. > > 1. Identify a subset or "chunk" of the primary id's to work on (so I > don't have to load everything into memory at once) and put those in a > temp table. I used a modulus on the id. > 2. Select all of the outer entity from the database (joining on the > id's in the temp table), and load the data from that result set into > new solr input documents. I keep these in a hash map keyed on the > id's. > 3. Then select all of the inner entity, joining on the id's from the > temp table. The result set has to include the id's from step 2. I go > through this result set and load the data into the matching solr input > documents from step 2. > 4. Push that set of input documents to solr (optionally committing > them), then go back to step 1 using the next subset or chunk. > > Not sure if this is the absolute best approach, but it's working well > enough for my specific case. > > Tim > > > 2010/12/15 Robert Gründler <rob...@dubture.com>: >> i've benchmarked the import already with 500k records, one time without the >> artists subquery, and one time without the join in the main query: >> >> >> Without subquery: 500k in 3 min 30 sec >> >> Without join and without subquery: 500k in 2 min 30. >> >> With subquery and with left join: 320k in 6 Min 30 >> >> >> so the joins / subqueries are definitely a bottleneck. >> >> How exactly did you implement the custom data import? >> >> In our case, we need to de-normalize the relations of the sql data for the >> index, >> so i fear i can't really get rid of the join / subquery. >> >> >> -robert >> >> >> >> >> >> On Dec 15, 2010, at 15:43 , Tim Heckman wrote: >> >>> 2010/12/15 Robert Gründler <rob...@dubture.com>: >>>> The data-config.xml looks like this (only 1 entity): >>>> >>>> <entity name="track" query="select t.id as id, t.title as title, >>>> l.title as label from track t left join label l on (l.id = t.label_id) >>>> where t.deleted = 0" transformer="TemplateTransformer"> >>>> <field column="title" name="title_t" /> >>>> <field column="label" name="label_t" /> >>>> <field column="id" name="sf_meta_id" /> >>>> <field column="metaclass" template="Track" name="sf_meta_class"/> >>>> <field column="metaid" template="${track.id}" name="sf_meta_id"/> >>>> <field column="uniqueid" template="Track_${track.id}" >>>> name="sf_unique_id"/> >>>> >>>> <entity name="artists" query="select a.name as artist from artist a >>>> left join track_artist ta on (ta.artist_id = a.id) where >>>> ta.track_id=${track.id}"> >>>> <field column="artist" name="artists_t" /> >>>> </entity> >>>> >>>> </entity> >>> >>> So there's one track entity with an artist sub-entity. My (admittedly >>> rather limited) experience has been that sub-entities, where you have >>> to run a separate query for every row in the parent entity, really >>> slow down data import. For my own purposes, I wrote a custom data >>> import using SolrJ to improve the performance (from 3 hours to 10 >>> minutes). >>> >>> Just as a test, how long does it take if you comment out the artists entity? >> >> > -- Lance Norskog goks...@gmail.com