Gustav,

DIH should give you the same results in both scenarios.  The performance 
trade-offs depend on your data.  In your case, it looks like there is a 1-to-1 
or many-to-1 relationship between "item" and "member", so use the SQL Join.  
You'll get all of your data in one query and you'll be using your rbdms for 
what it does best.

But in the case there was a 1-to-many relationship between "item" and "member", 
and especially if each "item" has several "member" rows, you might get better 
performance using the child entity setup.  Although by default DIH is going to 
do an "n+1" select on member.  For every row in item, it will issue a separate 
query to the db.  Also, DIH does not use prepared statements, so this might be 
a bad choice.  

To work around this, specify "cacheImpl='SortedMapBackedCache'" on the child 
entity (this is the same as using CachedSqlEntityProcessor instead of 
SqlEntityProcessor).  Do not include a "where" clause in this child entity.  
Instead, specify "cacheKey='memberId'" and "cacheLookup='item.memberId'".  DIH 
will now pull down your entire "member" table in 1 query and cache it in 
memory, then it can do fast hash joins against "item".

But if your "member" table is too big to fit into memory, then you need to use 
a disk-backed cache instead of SortedMapBackedCache.  For that, see 
https://issues.apache.org/jira/browse/SOLR-2948 and 
https://issues.apache.org/jira/browse/SOLR-2613 .

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Gustav [mailto:xbihy...@sharklasers.com] 
Sent: Thursday, April 25, 2013 9:17 AM
To: solr-user@lucene.apache.org
Subject: What is the difference between a Join Query and Embedded Entities in 
Solr DIH?

Hello guys, i saw this thread on stackoverflow, but still not satisfied with
the answers. 

I am trying to index data across multiple tables using Solr's Data Import
Handler. The official wiki on the DIH suggests using embedded entities to
link multiple tables like so:

<document>
    <entity name="item" pk="id" query="SELECT * FROM item">
        <entity name="member" pk="memberid" query="SELECT * FROM member
WHERE memberid='${item.memberid}'>
        </entity>
    </entity>
</document>

Another way that works is:

<document>
    <entity name="item" pk="id" query="SELECT * FROM item INNER JOIN member
ON item.memberid=member.memberid">
    </entity>
</document>

Are these two methods functionally different? Is there a performance
difference?

Another though would be that, if using join tables in MySQL, using the SQL
query method with multiple joins could cause multiple documents to be
indexed instead of one.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/What-is-the-difference-between-a-Join-Query-and-Embedded-Entities-in-Solr-DIH-tp4058923.html
Sent from the Solr - User mailing list archive at Nabble.com.


Reply via email to