I have now learned more about the CachedSqlEntityProcessor, so in theory it should be done with the server connections, but if it were me, I'd verify that.

How big are the resultsets from those queries? SOLR has to put all the table data into RAM, which is very likely going to take considerably more memory than the raw size of the results. Do you see any evidence that perhaps your machine is swapping? If you're on a UNIX variant, run "vmstat 5" and look for nonzero values in the "si" and "so" fields. If you're on Windows, pull up task manager's Performance tab and see if the PF Usage parameter is larger than physical memory. If this is happening, you'd definitely see constant hard drive light blinking.

On 3/6/2010 10:20 AM, Shawn Heisey wrote:
At the 9+ hour mark, is your database server showing active
connections that are sending data, or is all the activity local to SOLR?

We have a 40 million row database in MySQL, with each row comprising
more than 80 fields.  I'm including the config from one of our
shards.  There are about 6.6 million rows in this shard, and it
indexes to a 16GB index (9.6GB of which is the .fdt file) in 2-3 hours
depending on how loaded the database server is at the time.  I once
indexed the all 40 million rows into a shard and that only took 11
hours to build a 91GB index.

The batchSize parameter is necessary to have the jdbc driver stream
the results instead of trying to cache them all before sending them to
the application.  The server doesn't have enough memory for that.

<dataConfig>
<dataSource type="JdbcDataSource"
              driver="com.mysql.jdbc.Driver"
              encoding="UTF-8"

url="jdbc:mysql://[SERVER]:3306/[SCHEMA]?zeroDateTimeBehavior=convertToNull"

              batchSize="-1"
              user="[REMOVED]"
              password="[REMOVED]"/>
<document>
<entity name="[TABLE]"
            query="select * from [TABLE] where (did mod 6) = 0">
</entity>
</document>
</dataConfig>

On 3/6/2010 9:36 AM, JavaGuy84 wrote:
Hi,

  I am facing performance issue in SOLR when indexing huge data.
Please find
below the stats,

<str name="Time Elapsed">8:57:17.334</str>
<str name="Total Requests made to DataSource">42778</str>
<str name="Total Rows Fetched">273725</str>
<str name="Total Documents Processed">42775</str>
<str name="Total Documents Skipped">0</str>

Indexing of 273725 rows is taking almost 9 hours. Please find below
my Data
config file

<dataConfig>
<dataSource driver="com.metamatrix.jdbc.MMDriver" url="jdbc:" />
<document name="doc">
<entity name="object"
        query="select objectuid as uid, objectid, objecttype,
objectname,
repositoryname, a.lastupdateddate from  MetaModel.POC.Object a,
MetaModel.POC.Repository b where a.repositoryid = b.repositoryid"
transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer">

<field column="objectname" name="name"/>
<field column="uid" name="uid"/>
<field column="objectid" name="id"/>
<field column="objecttype" name="type"/>
<field column="repositoryname" name="repository"/>

<entity name="property" query="select ObjectUID,ObjectPropertyName as
name, ObjectPropertyValue as value from  MetaModel.POC.ObjectProperty"
processor="CachedSqlEntityProcessor" cacheKey="ObjectUID"
cacheLookup="object.uid"
transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer">


<field column="value" name="${property.name}"/>

</entity>

<entity name="relationship_entity" query="select
OBJECT1uid,Object2name as
rname,Object2type as rtype,relationshiptype as rship,
b.RepositoryName as
rrepname from  MetaModel.POC.BinaryRelationShip a,
MetaModel.POC.Repository
b where a.Object2RepositoryId=b.repositoryId"
processor="CachedSqlEntityProcessor" cacheKey="OBJECT1uid"
cacheLookup="object.uid"
transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer">

<field column="rship" name="relationship"/>
<field column="rname" name="related_name" />
<field column="rtype" name="related_type"/>
<field column="rrepname" name="repositoryname"/>
</entity>

</entity>
</document>


Time taken to directly query the database with the above mentioned SQL
statements,


select objectuid as uid, objectid, objecttype, objectname,
repositoryname,
a.lastupdateddate from  MetaModel.POC.Object a,
MetaModel.POC.Repository b
where a.repositoryid = b.repositoryid --->  3 minutes

select ObjectUID,ObjectPropertyName as name, ObjectPropertyValue as
value
from  MetaModel.POC.ObjectProperty -->  5 minutes


select OBJECT1uid,Object2name as rname,Object2type as
rtype,relationshiptype
as rship, b.RepositoryName as rrepname from
MetaModel.POC.BinaryRelationShip a,  MetaModel.POC.Repository b where
a.Object2RepositoryId=b.repositoryId" -->  3 seconds

As I am using CachedSqlEntityProcessor I assume that SOLR first
issues these
select statements (mentioned above first) and then it match based on
cacheKey (from caching), so SOLR should ideally take  (addition of time
taken to execute the above 3 queries + some time for doing filtering
based
on cacheKey ). But in my case its taking hours and hours for indexing.

Can someone please let me know if I am doing anything wrong which might
cause this issue?


Reply via email to