Re: Large RDBMS dataset

Gora Mohanty Wed, 14 Dec 2011 07:40:22 -0800

On Wed, Dec 14, 2011 at 3:48 PM, Finotti Simone <tech...@yoox.com> wrote:
> Hello,
> I have a very large dataset (> 1 Mrecords) on the RDBMS which I want my Solr 
> application to pull data from.
[...]


> It works, but it takes 1'38" to parse 100 records: it means 1 rec/s! That 
> means that digesting the whole dataset would take 1 Ms (=> 12 days).

Depending on the size of the data that you are pulling from
the database, 1M records is not really that large a number.
We were doing ~75GB of stored data from ~7million records
in about 9h, including quite complicated transfomers. I would
imagine that there is much room for improvement in your case
also. Some notes on this:
* If you have servers to throw at the problem, and a sensible
  way to shard your RDBMS data, use parallel indexing to
  multiple Solr cores, maybe on multiple servers, followed by
  a merge. In our experience, given enough RAM and adequate
  provisioning of database servers, indexing speed scales linearly
  with the total no. of cores.
* Replicate your database, manually if needed. Look at the load
  on a database server during the indexing process, and provision
  enough database servers to match the no. of Solr indexing servers.
* This point is leading into flamewar territory, but consider switching
   databases. From our (admittedly non-rigorous measurements),
   mysql was at least a factor of 2-3 faster than MS-SQL, with the
   same dataset.
* Look at cloud-computing. If finances permit, one should be able
  to shrink indexing times to almost any desired level. E.g., for the
  dataset that we used, I have little doubt that we could have shrunk
  the time down to less than 1h, at an affordable cost on Amazon EC2.
  Unfortunately, we have not yet had the opportunity to try this.

> The problem is that for each record in "fd", Solr makes three distinct SELECT 
> on the other three tables. Of course, this is absolutely inefficient.
>
> Is there a way to have Solr loading every record in the four tables and join 
> them when they are already loaded in memory?

For various reasons, we did not investigate this in depth,
but you could also look at Solr's CachedSqlEntityProcessor.

Regards,
Gora

Re: Large RDBMS dataset

Reply via email to