On Wed, Dec 14, 2011 at 3:48 PM, Finotti Simone <tech...@yoox.com> wrote: > Hello, > I have a very large dataset (> 1 Mrecords) on the RDBMS which I want my Solr > application to pull data from. [...]
> It works, but it takes 1'38" to parse 100 records: it means 1 rec/s! That > means that digesting the whole dataset would take 1 Ms (=> 12 days). Depending on the size of the data that you are pulling from the database, 1M records is not really that large a number. We were doing ~75GB of stored data from ~7million records in about 9h, including quite complicated transfomers. I would imagine that there is much room for improvement in your case also. Some notes on this: * If you have servers to throw at the problem, and a sensible way to shard your RDBMS data, use parallel indexing to multiple Solr cores, maybe on multiple servers, followed by a merge. In our experience, given enough RAM and adequate provisioning of database servers, indexing speed scales linearly with the total no. of cores. * Replicate your database, manually if needed. Look at the load on a database server during the indexing process, and provision enough database servers to match the no. of Solr indexing servers. * This point is leading into flamewar territory, but consider switching databases. From our (admittedly non-rigorous measurements), mysql was at least a factor of 2-3 faster than MS-SQL, with the same dataset. * Look at cloud-computing. If finances permit, one should be able to shrink indexing times to almost any desired level. E.g., for the dataset that we used, I have little doubt that we could have shrunk the time down to less than 1h, at an affordable cost on Amazon EC2. Unfortunately, we have not yet had the opportunity to try this. > The problem is that for each record in "fd", Solr makes three distinct SELECT > on the other three tables. Of course, this is absolutely inefficient. > > Is there a way to have Solr loading every record in the four tables and join > them when they are already loaded in memory? For various reasons, we did not investigate this in depth, but you could also look at Solr's CachedSqlEntityProcessor. Regards, Gora