Shawn,

Thanks a lot for your response,

Yes, still the DB connection is active.. It is still fetching the data from
the DB. 

I am using Redhat MetaMatrix DB as backend and I am trying to find out the
parameter for setting the JDBC fetch size..

Do you think that this problem will be mostly due to fetch size?

Thanks,
Barani



Shawn Heisey-4 wrote:
> 
> At the 9+ hour mark, is your database server showing active connections 
> that are sending data, or is all the activity local to SOLR?
> 
> We have a 40 million row database in MySQL, with each row comprising 
> more than 80 fields.  I'm including the config from one of our shards.  
> There are about 6.6 million rows in this shard, and it indexes to a 16GB 
> index (9.6GB of which is the .fdt file) in 2-3 hours depending on how 
> loaded the database server is at the time.  I once indexed the all 40 
> million rows into a shard and that only took 11 hours to build a 91GB
> index.
> 
> The batchSize parameter is necessary to have the jdbc driver stream the 
> results instead of trying to cache them all before sending them to the 
> application.  The server doesn't have enough memory for that.
> 
> <dataConfig>
> <dataSource type="JdbcDataSource"
>                driver="com.mysql.jdbc.Driver"
>                encoding="UTF-8"
>                
> url="jdbc:mysql://[SERVER]:3306/[SCHEMA]?zeroDateTimeBehavior=convertToNull"
>                batchSize="-1"
>                user="[REMOVED]"
>                password="[REMOVED]"/>
> <document>
> <entity name="[TABLE]"
>              query="select * from [TABLE] where (did mod 6) = 0">
> </entity>
> </document>
> </dataConfig>
> 
> On 3/6/2010 9:36 AM, JavaGuy84 wrote:
>> Hi,
>>
>>   I am facing performance issue in SOLR when indexing huge data. Please
>> find
>> below the stats,
>>
>>    <str name="Time Elapsed">8:57:17.334</str>
>>    <str name="Total Requests made to DataSource">42778</str>
>>    <str name="Total Rows Fetched">273725</str>
>>    <str name="Total Documents Processed">42775</str>
>>    <str name="Total Documents Skipped">0</str>
>>
>> Indexing of 273725 rows is taking almost 9 hours. Please find below my
>> Data
>> config file
>>
>> <dataConfig>
>>      <dataSource driver="com.metamatrix.jdbc.MMDriver" url="jdbc:" />
>>        <document name="doc">
>>      <entity name="object"
>>              query="select objectuid as uid, objectid, objecttype, 
>> objectname,
>> repositoryname, a.lastupdateddate from  MetaModel.POC.Object a,
>> MetaModel.POC.Repository b where a.repositoryid = b.repositoryid"
>> transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer">    
>>         
>>              <field column="objectname" name="name"/>
>>                      <field column="uid" name="uid"/>
>>                      <field column="objectid" name="id"/>
>>                      <field column="objecttype" name="type"/>
>>                      <field column="repositoryname" name="repository"/>
>>              
>>              <entity name="property" query="select 
>> ObjectUID,ObjectPropertyName as
>> name, ObjectPropertyValue as value from  MetaModel.POC.ObjectProperty"
>> processor="CachedSqlEntityProcessor" cacheKey="ObjectUID"
>> cacheLookup="object.uid"
>> transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer">
>>                                      
>>              <field column="value" name="${property.name}"/>
>>
>>              </entity>
>>              
>>              <entity name="relationship_entity" query="select 
>> OBJECT1uid,Object2name
>> as
>> rname,Object2type as rtype,relationshiptype as rship, b.RepositoryName as
>> rrepname from  MetaModel.POC.BinaryRelationShip a, 
>> MetaModel.POC.Repository
>> b where a.Object2RepositoryId=b.repositoryId"
>> processor="CachedSqlEntityProcessor" cacheKey="OBJECT1uid"
>> cacheLookup="object.uid"
>> transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer">
>>                      <field column="rship" name="relationship"/>
>>                      <field column="rname" name="related_name" />
>>                      <field column="rtype" name="related_type"/>
>>                      <field column="rrepname" name="repositoryname"/>
>>              </entity>
>>      
>>      </entity>
>>     </document>
>>
>>
>> Time taken to directly query the database with the above mentioned SQL
>> statements,
>>
>>
>> select objectuid as uid, objectid, objecttype, objectname,
>> repositoryname,
>> a.lastupdateddate from  MetaModel.POC.Object a,  MetaModel.POC.Repository
>> b
>> where a.repositoryid = b.repositoryid --->  3 minutes
>>
>> select ObjectUID,ObjectPropertyName as name, ObjectPropertyValue as value
>> from  MetaModel.POC.ObjectProperty -->  5 minutes
>>
>>
>> select OBJECT1uid,Object2name as rname,Object2type as
>> rtype,relationshiptype
>> as rship, b.RepositoryName as rrepname from
>> MetaModel.POC.BinaryRelationShip a,  MetaModel.POC.Repository b where
>> a.Object2RepositoryId=b.repositoryId" -->  3 seconds
>>
>> As I am using CachedSqlEntityProcessor I assume that SOLR first issues
>> these
>> select statements (mentioned above first) and then it match based on
>> cacheKey (from caching), so SOLR should ideally take  (addition of time
>> taken to execute the above 3 queries + some time for doing filtering
>> based
>> on cacheKey ). But in my case its taking hours and hours for indexing.
>>
>> Can someone please let me know if I am doing anything wrong which might
>> cause this issue?
>>    
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/SOLR-takes-more-than-9-hours-to-index-300000-rows-tp27805403p27806375.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to