Shawn, Thanks a lot for your response,
Yes, still the DB connection is active.. It is still fetching the data from the DB. I am using Redhat MetaMatrix DB as backend and I am trying to find out the parameter for setting the JDBC fetch size.. Do you think that this problem will be mostly due to fetch size? Thanks, Barani Shawn Heisey-4 wrote: > > At the 9+ hour mark, is your database server showing active connections > that are sending data, or is all the activity local to SOLR? > > We have a 40 million row database in MySQL, with each row comprising > more than 80 fields. I'm including the config from one of our shards. > There are about 6.6 million rows in this shard, and it indexes to a 16GB > index (9.6GB of which is the .fdt file) in 2-3 hours depending on how > loaded the database server is at the time. I once indexed the all 40 > million rows into a shard and that only took 11 hours to build a 91GB > index. > > The batchSize parameter is necessary to have the jdbc driver stream the > results instead of trying to cache them all before sending them to the > application. The server doesn't have enough memory for that. > > <dataConfig> > <dataSource type="JdbcDataSource" > driver="com.mysql.jdbc.Driver" > encoding="UTF-8" > > url="jdbc:mysql://[SERVER]:3306/[SCHEMA]?zeroDateTimeBehavior=convertToNull" > batchSize="-1" > user="[REMOVED]" > password="[REMOVED]"/> > <document> > <entity name="[TABLE]" > query="select * from [TABLE] where (did mod 6) = 0"> > </entity> > </document> > </dataConfig> > > On 3/6/2010 9:36 AM, JavaGuy84 wrote: >> Hi, >> >> I am facing performance issue in SOLR when indexing huge data. Please >> find >> below the stats, >> >> <str name="Time Elapsed">8:57:17.334</str> >> <str name="Total Requests made to DataSource">42778</str> >> <str name="Total Rows Fetched">273725</str> >> <str name="Total Documents Processed">42775</str> >> <str name="Total Documents Skipped">0</str> >> >> Indexing of 273725 rows is taking almost 9 hours. Please find below my >> Data >> config file >> >> <dataConfig> >> <dataSource driver="com.metamatrix.jdbc.MMDriver" url="jdbc:" /> >> <document name="doc"> >> <entity name="object" >> query="select objectuid as uid, objectid, objecttype, >> objectname, >> repositoryname, a.lastupdateddate from MetaModel.POC.Object a, >> MetaModel.POC.Repository b where a.repositoryid = b.repositoryid" >> transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer"> >> >> <field column="objectname" name="name"/> >> <field column="uid" name="uid"/> >> <field column="objectid" name="id"/> >> <field column="objecttype" name="type"/> >> <field column="repositoryname" name="repository"/> >> >> <entity name="property" query="select >> ObjectUID,ObjectPropertyName as >> name, ObjectPropertyValue as value from MetaModel.POC.ObjectProperty" >> processor="CachedSqlEntityProcessor" cacheKey="ObjectUID" >> cacheLookup="object.uid" >> transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer"> >> >> <field column="value" name="${property.name}"/> >> >> </entity> >> >> <entity name="relationship_entity" query="select >> OBJECT1uid,Object2name >> as >> rname,Object2type as rtype,relationshiptype as rship, b.RepositoryName as >> rrepname from MetaModel.POC.BinaryRelationShip a, >> MetaModel.POC.Repository >> b where a.Object2RepositoryId=b.repositoryId" >> processor="CachedSqlEntityProcessor" cacheKey="OBJECT1uid" >> cacheLookup="object.uid" >> transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer"> >> <field column="rship" name="relationship"/> >> <field column="rname" name="related_name" /> >> <field column="rtype" name="related_type"/> >> <field column="rrepname" name="repositoryname"/> >> </entity> >> >> </entity> >> </document> >> >> >> Time taken to directly query the database with the above mentioned SQL >> statements, >> >> >> select objectuid as uid, objectid, objecttype, objectname, >> repositoryname, >> a.lastupdateddate from MetaModel.POC.Object a, MetaModel.POC.Repository >> b >> where a.repositoryid = b.repositoryid ---> 3 minutes >> >> select ObjectUID,ObjectPropertyName as name, ObjectPropertyValue as value >> from MetaModel.POC.ObjectProperty --> 5 minutes >> >> >> select OBJECT1uid,Object2name as rname,Object2type as >> rtype,relationshiptype >> as rship, b.RepositoryName as rrepname from >> MetaModel.POC.BinaryRelationShip a, MetaModel.POC.Repository b where >> a.Object2RepositoryId=b.repositoryId" --> 3 seconds >> >> As I am using CachedSqlEntityProcessor I assume that SOLR first issues >> these >> select statements (mentioned above first) and then it match based on >> cacheKey (from caching), so SOLR should ideally take (addition of time >> taken to execute the above 3 queries + some time for doing filtering >> based >> on cacheKey ). But in my case its taking hours and hours for indexing. >> >> Can someone please let me know if I am doing anything wrong which might >> cause this issue? >> > > > -- View this message in context: http://old.nabble.com/SOLR-takes-more-than-9-hours-to-index-300000-rows-tp27805403p27806375.html Sent from the Solr - User mailing list archive at Nabble.com.