On Fri, Jul 24, 2015 at 1:06 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 7/23/2015 10:55 AM, cbuxbaum wrote: > > Say we have 1000000 party records. Then the child SQL will be run > 1000000 > > times (once for each party record). Isn't there a way to just run the > child > > SQL on all of the party records at once with a join, using a GROUP BY and > > ORDER BY on the PARTY_ID? Then the results from that query could easily > be > > placed in SOLR according to the primary key (party_id). Is there some > part > > of the Data Import Handler that operates that way? > > Using well-crafted SQL JOIN is almost always going to be better for > dataimport than nested entities. The heavy lifting is done by the > database server, using code that's extremely well-optimized for that > kind of lifting. Doing what you describe with a parent entity and one > nested entity (that is not cached) will result in 1000001 total SQL > queries. A million SQL queries, no matter how fast each one is, will be > slow. > > If you can do everything in a single SQL query with JOIN, then Solr will > make exactly one SQL query to the server for a full-import. > > For my own dataimport, I use a view that was defined on the mysql server > by the dbadmin. The view does all the JOINs we require. > > Solr's dataimport handler doesn't have any intelligence to do the join > locally. It would be cool if it did, but somebody would have to write > the code to teach it how. Because the DB server itself can already do > JOINs, and it can do them VERY well, there's really no reason to teach > it to Solr. > fwiw, DIH now has join=”zipper” <https://issues.apache.org/jira/browse/SOLR-4799> attribute which can be specified to child entity, it enables classic ETL external merge join algorithm. > Thanks, > Shawn > > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>