On Fri, Jul 24, 2015 at 1:06 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 7/23/2015 10:55 AM, cbuxbaum wrote:
> > Say we have 1000000 party records.  Then the child SQL will be run
> 1000000
> > times (once for each party record).  Isn't there a way to just run the
> child
> > SQL on all of the party records at once with a join, using a GROUP BY and
> > ORDER BY on the PARTY_ID?  Then the results from that query could easily
> be
> > placed in SOLR according to the primary key (party_id).  Is there some
> part
> > of the Data Import Handler that operates that way?
>
> Using well-crafted SQL JOIN is almost always going to be better for
> dataimport than nested entities.  The heavy lifting is done by the
> database server, using code that's extremely well-optimized for that
> kind of lifting.  Doing what you describe with a parent entity and one
> nested entity (that is not cached) will result in 1000001 total SQL
> queries.  A million SQL queries, no matter how fast each one is, will be
> slow.
>
> If you can do everything in a single SQL query with JOIN, then Solr will
> make exactly one SQL query to the server for a full-import.
>
> For my own dataimport, I use a view that was defined on the mysql server
> by the dbadmin.  The view does all the JOINs we require.
>
> Solr's dataimport handler doesn't have any intelligence to do the join
> locally.  It would be cool if it did, but somebody would have to write
> the code to teach it how.  Because the DB server itself can already do
> JOINs, and it can do them VERY well, there's really no reason to teach
> it to Solr.
>

fwiw, DIH now has join=”zipper”
<https://issues.apache.org/jira/browse/SOLR-4799> attribute which can be
specified to child entity, it enables classic ETL external merge join
algorithm.


> Thanks,
> Shawn
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mkhlud...@griddynamics.com>

Reply via email to