Hi,

thanks for the other ideas. 

I worked around the problem with the idea of Paul Noble. This is working really 
fine for me right now. My full-import is at around 40 minutes and my 
delta-import runs in less than 10 seconds, because it runs every minute. So 
that configuration seems to be pretty optimal for my set up. 

Idea 1: 
Will try it out some time soon

Idea2: 
Tried that one. But that slows down the full-import in my case too. One table I 
use the cache for has more rows than the root entities table. So it has 
multiple rows per row of the root entity. So a cache that is being build up 
during the import does not help here since caches rows are only used once. The 
only benefit I have from the cache is through the prefilling. The prefilling is 
a lot faster than reading the rows on demand. 

Idea3: 
Probably would also be slower than the current configuration since the 
prefilling takes around 2 minutes. Since my delta-import currently runs every 
minute that would not make sense. 


Thanks and Regards

Constantin

-----Ursprüngliche Nachricht-----
Von: Dyer, James [mailto:james.d...@ingramcontent.com] 
Gesendet: Donnerstag, 20. Juni 2013 18:51
An: solr-user@lucene.apache.org
Betreff: RE: DataImportHandler: Problems with delta-import and 
CachedSqlEntityProcessor

Instead of specifying CachedSqlEntityProcessor, you can specify 
SqlEntityProcessor with "cacheImpl='SortedMapBackedCache'".  If you 
parametertize this, to have "SortedMapBackedCache" for full updates but blank 
for deltas I think it will cache only on the full import.

Another option is to parameterize the child queries with a "where" clause, so 
if it is creating a new cache with every row, the cache will only contain the 
data needed for that child row.

A third option is to do your delta imports like described here:  
http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport
My experience is that this generally performs better than using the delta 
import feature anyhow.  The trick is on handling deletes, which will require 
its own entity and the $deleteDocById command.  See 
http://wiki.apache.org/solr/DataImportHandler#Special_Commands

But these are all workarounds.  This sounds like a bug or some subtle 
configuration problem.  I looked through the JIRA issues and did not see 
anything like this reported yet, but if you're pretty sure you are doing 
everything correctly you may want to open a bug ticket.  Be sure to flag it as 
"contrib - Dataimporthandler".

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Constantin Wolber [mailto:constantin.wol...@medicalcolumbus.de] 
Sent: Thursday, June 20, 2013 3:21 AM
To: solr-user@lucene.apache.org
Subject: DataImportHandler: Problems with delta-import and 
CachedSqlEntityProcessor

Hi,

i searched for a solution for quite some time but did not manage to find some 
real hints on how to fix it. 


I'm using solr 4.3.0 1477023 - simonw - 2013-04-29 15:10:12 running in a tomcat 
6 container.

My data import setup is basically the following:

Data-config.xml:

<entity
        name="article"
        dataSource="ds1"
        query="SELECT * FROM article"
        deltaQuery="SELECT myownid FROM articleHistory WHERE modified_date &gt; 
'${dih.last_index_time}
        deltaImportQuery="SELECT * FROM article WHERE 
myownid=${dih.delta.myownid}"
        pk="myownid">
        <field column="myownid" name="id"/>

        <entity
                name="supplier"
                dataSource="ds2"
                query="SELECT * FROM supplier WHERE status=1"
                processor="CachedSqlEntityProcessor"
                cacheKey="SUPPLIER_ID"
                cacheLookup="article.ARTICLE_SUPPLIER_ID">
        </entity>

        <entity
                name="attributes"
                dataSource="ds1"
                query="SELECT ARTICLE_ID,'Key:'+ATTRIBUTE_KEY+' 
Value:'+ATTRIBUTE_VALUE FROM attributes"
                cacheKey="ARTICLE_ID"
                cacheLookup="article.myownid"
                processor="CachedSqlEntityProcessor">
        </entity>               
</entity>


Ok now for the problem: 

At first I tried everything without the Cache. But the full-import took a very 
long time. Because the attributes query is pretty slow compared to the rest. As 
a result I got a processing speed of around 150 Documents/s.
When switching everything to the CachedSqlEntityProcessor the full import 
processed at the speed of 4000 Documents/s

So full import is running quite fine. Now I wanted to use the delta import. 
When running the delta import I was expecting the ramp up time to be about the 
same as in full import since I need to load the whole table supplier and 
attributes to the cache in the first step. But when looking into the log file 
the weird thing is solr seems to refresh the Cache for every single document 
that is processed. So currently my delta-import is a lot slower than the 
full-import. I even tried to add the deltaImportQuery parameter to the entity 
but it doesn't change the behavior at all (of course I know it is not supposed 
to change anything in the setup I run).

The following solutions would be possible in my opinion: 

1. Is there any way to tell the config to ignore the Cache when running a delta 
import? That would help already because we are talking about the maximum of 500 
documents changed in 15 minutes compared to over 5 million documents in total. 
2. Get solr to not refresh the cash for every document. 

Best Regards

Constantin Wolber



Reply via email to