On 5/26/2014 7:50 AM, Ali Nazemian wrote:
> I was wondering which scenario (or the combination) would be better for my
> application. From the aspect of performance, scalability and high
> availability. Here is my application:
> 
> Suppose I am going to have more than 10m documents and it grows every day.
> (probably in 1 years it reaches to more than 100m docs. I want to use Solr
> as tool for indexing these documents but the problem is I have some data
> fields that could change frequently. (not too much but it could change)

Choosing which database software to use to hold your data is a problem
with many possible solutions.  Everyone will have a different answer for
you.  Each solution has strengths and weaknesses, and in the end, only
you can really know what your requirements are.

> Scenarios:
> 
> 1- Using SolrCloud as database for all data. (even the one that could be
> changed)

If you choose to use Solr as a NoSQL, I would strongly recommend that
you have two Solr installs.  The first install would be purely for data
storage and would have no indexed fields.  If you can get machines with
enough RAM, it would also probably be preferable to use a single index
(or SolrCloud with one shard) for that install.  The other install would
be for searching.  Sharding would not be an issue on that index.  The
reason that I make this recommendation is that when you use Solr for
searching, you have to do a complete reindex if you change your search
schema.  It's difficult to reindex if the search index is also your
canonical data source.

> 2- Using SolrCloud as database for static data and using RDBMS (such as
> oracle) for storing dynamic fields.

I don't think it would be a good idea to have two canonical data
sources.  Pick one.  As already mentioned, Solr is better as a search
technology, serving up pointers to data in another data source, than as
a database.

If you want to use RDBMS technology, why would you spend all that money
on Oracle?  Just use one of the free databases.  Our really large Solr
index comes from a database.  At one time that database was in Oracle.
When my employer purchased the company with that database, we thought we
were obtaining a full Oracle license.  It turns out we weren't.  It
would have cost about half a million dollars to buy that license, so we
switched to MySQL.

Since making that move to MySQL, performance is actually *better*.  The
source table for our data has 96 million rows right now, growing at a
rate of a few million per year.  This is completely in line with your
100 million document requirement.  For the massive table that feeds
Solr, we might switch to MongoDB, but that has not been decided yet.

Later we switched from EasyAsk to Solr, a move that has *also* given us
better performance.  Because both MySQL and Solr are free, we've
achieved a substantial cost savings.

> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
> data.

I have no experience with this technology, but I think that if you are
thinking about a database on HDFS, you're probably actually talking
about HBase, the Apache implementation of Google's BigTable.

Thanks,
Shawn

Reply via email to