Dear Shawn, Hi and thank you for you reply. Could you please tell me about the performance and scalability of the mentioned solutions? Suppose I have a SolrCloud with 4 different machine. Would it scale linearly if I add another 4 machines to that? I mean when the documents number increases from 10m to 100m documents. Regards.
On Mon, May 26, 2014 at 8:30 PM, Shawn Heisey <s...@elyograg.org> wrote: > On 5/26/2014 7:50 AM, Ali Nazemian wrote: > > I was wondering which scenario (or the combination) would be better for > my > > application. From the aspect of performance, scalability and high > > availability. Here is my application: > > > > Suppose I am going to have more than 10m documents and it grows every > day. > > (probably in 1 years it reaches to more than 100m docs. I want to use > Solr > > as tool for indexing these documents but the problem is I have some data > > fields that could change frequently. (not too much but it could change) > > Choosing which database software to use to hold your data is a problem > with many possible solutions. Everyone will have a different answer for > you. Each solution has strengths and weaknesses, and in the end, only > you can really know what your requirements are. > > > Scenarios: > > > > 1- Using SolrCloud as database for all data. (even the one that could be > > changed) > > If you choose to use Solr as a NoSQL, I would strongly recommend that > you have two Solr installs. The first install would be purely for data > storage and would have no indexed fields. If you can get machines with > enough RAM, it would also probably be preferable to use a single index > (or SolrCloud with one shard) for that install. The other install would > be for searching. Sharding would not be an issue on that index. The > reason that I make this recommendation is that when you use Solr for > searching, you have to do a complete reindex if you change your search > schema. It's difficult to reindex if the search index is also your > canonical data source. > > > 2- Using SolrCloud as database for static data and using RDBMS (such as > > oracle) for storing dynamic fields. > > I don't think it would be a good idea to have two canonical data > sources. Pick one. As already mentioned, Solr is better as a search > technology, serving up pointers to data in another data source, than as > a database. > > If you want to use RDBMS technology, why would you spend all that money > on Oracle? Just use one of the free databases. Our really large Solr > index comes from a database. At one time that database was in Oracle. > When my employer purchased the company with that database, we thought we > were obtaining a full Oracle license. It turns out we weren't. It > would have cost about half a million dollars to buy that license, so we > switched to MySQL. > > Since making that move to MySQL, performance is actually *better*. The > source table for our data has 96 million rows right now, growing at a > rate of a few million per year. This is completely in line with your > 100 million document requirement. For the massive table that feeds > Solr, we might switch to MongoDB, but that has not been decided yet. > > Later we switched from EasyAsk to Solr, a move that has *also* given us > better performance. Because both MySQL and Solr are free, we've > achieved a substantial cost savings. > > > 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all > > data. > > I have no experience with this technology, but I think that if you are > thinking about a database on HDFS, you're probably actually talking > about HBase, the Apache implementation of Google's BigTable. > > Thanks, > Shawn > > -- A.Nazemian