Shawn,
Thanks a lot for your response,
Yes, still the DB connection is active.. It is still fetching the data from
the DB.
I am using Redhat MetaMatrix DB as backend and I am trying to find out the
parameter for setting the JDBC fetch size..
Do you think that this problem will be mostly due to fetch size?
Thanks,
Barani
Shawn Heisey-4 wrote:
>
> At the 9+ hour mark, is your database server showing active connections
> that are sending data, or is all the activity local to SOLR?
>
> We have a 40 million row database in MySQL, with each row comprising
> more than 80 fields. I'm including the config from one of our shards.
> There are about 6.6 million rows in this shard, and it indexes to a 16GB
> index (9.6GB of which is the .fdt file) in 2-3 hours depending on how
> loaded the database server is at the time. I once indexed the all 40
> million rows into a shard and that only took 11 hours to build a 91GB
> index.
>
> The batchSize parameter is necessary to have the jdbc driver stream the
> results instead of trying to cache them all before sending them to the
> application. The server doesn't have enough memory for that.
>
> <dataConfig>
> <dataSource type="JdbcDataSource"
> driver="com.mysql.jdbc.Driver"
> encoding="UTF-8"
>
> url="jdbc:mysql://[SERVER]:3306/[SCHEMA]?zeroDateTimeBehavior=convertToNull"
> batchSize="-1"
> user="[REMOVED]"
> password="[REMOVED]"/>
> <document>
> <entity name="[TABLE]"
> query="select * from [TABLE] where (did mod 6) = 0">
> </entity>
> </document>
> </dataConfig>
>
> On 3/6/2010 9:36 AM, JavaGuy84 wrote:
>> Hi,
>>
>> I am facing performance issue in SOLR when indexing huge data. Please
>> find
>> below the stats,
>>
>> <str name="Time Elapsed">8:57:17.334</str>
>> <str name="Total Requests made to DataSource">42778</str>
>> <str name="Total Rows Fetched">273725</str>
>> <str name="Total Documents Processed">42775</str>
>> <str name="Total Documents Skipped">0</str>
>>
>> Indexing of 273725 rows is taking almost 9 hours. Please find below my
>> Data
>> config file
>>
>> <dataConfig>
>> <dataSource driver="com.metamatrix.jdbc.MMDriver" url="jdbc:" />
>> <document name="doc">
>> <entity name="object"
>> query="select objectuid as uid, objectid, objecttype,
>> objectname,
>> repositoryname, a.lastupdateddate from MetaModel.POC.Object a,
>> MetaModel.POC.Repository b where a.repositoryid = b.repositoryid"
>> transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer">
>>
>> <field column="objectname" name="name"/>
>> <field column="uid" name="uid"/>
>> <field column="objectid" name="id"/>
>> <field column="objecttype" name="type"/>
>> <field column="repositoryname" name="repository"/>
>>
>> <entity name="property" query="select
>> ObjectUID,ObjectPropertyName as
>> name, ObjectPropertyValue as value from MetaModel.POC.ObjectProperty"
>> processor="CachedSqlEntityProcessor" cacheKey="ObjectUID"
>> cacheLookup="object.uid"
>> transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer">
>>
>> <field column="value" name="${property.name}"/>
>>
>> </entity>
>>
>> <entity name="relationship_entity" query="select
>> OBJECT1uid,Object2name
>> as
>> rname,Object2type as rtype,relationshiptype as rship, b.RepositoryName as
>> rrepname from MetaModel.POC.BinaryRelationShip a,
>> MetaModel.POC.Repository
>> b where a.Object2RepositoryId=b.repositoryId"
>> processor="CachedSqlEntityProcessor" cacheKey="OBJECT1uid"
>> cacheLookup="object.uid"
>> transformer="RegexTransformer,DateFormatTransformer,TemplateTransformer">
>> <field column="rship" name="relationship"/>
>> <field column="rname" name="related_name" />
>> <field column="rtype" name="related_type"/>
>> <field column="rrepname" name="repositoryname"/>
>> </entity>
>>
>> </entity>
>> </document>
>>
>>
>> Time taken to directly query the database with the above mentioned SQL
>> statements,
>>
>>
>> select objectuid as uid, objectid, objecttype, objectname,
>> repositoryname,
>> a.lastupdateddate from MetaModel.POC.Object a, MetaModel.POC.Repository
>> b
>> where a.repositoryid = b.repositoryid ---> 3 minutes
>>
>> select ObjectUID,ObjectPropertyName as name, ObjectPropertyValue as value
>> from MetaModel.POC.ObjectProperty --> 5 minutes
>>
>>
>> select OBJECT1uid,Object2name as rname,Object2type as
>> rtype,relationshiptype
>> as rship, b.RepositoryName as rrepname from
>> MetaModel.POC.BinaryRelationShip a, MetaModel.POC.Repository b where
>> a.Object2RepositoryId=b.repositoryId" --> 3 seconds
>>
>> As I am using CachedSqlEntityProcessor I assume that SOLR first issues
>> these
>> select statements (mentioned above first) and then it match based on
>> cacheKey (from caching), so SOLR should ideally take (addition of time
>> taken to execute the above 3 queries + some time for doing filtering
>> based
>> on cacheKey ). But in my case its taking hours and hours for indexing.
>>
>> Can someone please let me know if I am doing anything wrong which might
>> cause this issue?
>>
>
>
>
--
View this message in context:
http://old.nabble.com/SOLR-takes-more-than-9-hours-to-index-300000-rows-tp27805403p27806375.html
Sent from the Solr - User mailing list archive at Nabble.com.