On 5/26/2014 7:50 AM, Ali Nazemian wrote: > I was wondering which scenario (or the combination) would be better for my > application. From the aspect of performance, scalability and high > availability. Here is my application: > > Suppose I am going to have more than 10m documents and it grows every day. > (probably in 1 years it reaches to more than 100m docs. I want to use Solr > as tool for indexing these documents but the problem is I have some data > fields that could change frequently. (not too much but it could change)
Choosing which database software to use to hold your data is a problem with many possible solutions. Everyone will have a different answer for you. Each solution has strengths and weaknesses, and in the end, only you can really know what your requirements are. > Scenarios: > > 1- Using SolrCloud as database for all data. (even the one that could be > changed) If you choose to use Solr as a NoSQL, I would strongly recommend that you have two Solr installs. The first install would be purely for data storage and would have no indexed fields. If you can get machines with enough RAM, it would also probably be preferable to use a single index (or SolrCloud with one shard) for that install. The other install would be for searching. Sharding would not be an issue on that index. The reason that I make this recommendation is that when you use Solr for searching, you have to do a complete reindex if you change your search schema. It's difficult to reindex if the search index is also your canonical data source. > 2- Using SolrCloud as database for static data and using RDBMS (such as > oracle) for storing dynamic fields. I don't think it would be a good idea to have two canonical data sources. Pick one. As already mentioned, Solr is better as a search technology, serving up pointers to data in another data source, than as a database. If you want to use RDBMS technology, why would you spend all that money on Oracle? Just use one of the free databases. Our really large Solr index comes from a database. At one time that database was in Oracle. When my employer purchased the company with that database, we thought we were obtaining a full Oracle license. It turns out we weren't. It would have cost about half a million dollars to buy that license, so we switched to MySQL. Since making that move to MySQL, performance is actually *better*. The source table for our data has 96 million rows right now, growing at a rate of a few million per year. This is completely in line with your 100 million document requirement. For the massive table that feeds Solr, we might switch to MongoDB, but that has not been decided yet. Later we switched from EasyAsk to Solr, a move that has *also* given us better performance. Because both MySQL and Solr are free, we've achieved a substantial cost savings. > 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all > data. I have no experience with this technology, but I think that if you are thinking about a database on HDFS, you're probably actually talking about HBase, the Apache implementation of Google's BigTable. Thanks, Shawn