What you haven't told us is where the data comes from. But until you put some numbers to it, it's hard to decide.
I tend to prefer storing the data somewhere else, filesystem, whatever and indexing to Solr when data changes. Even if that means re-indexing the entire corpus. I don't like going to more complicated solutions until that proves untenable. Backup/restore solutions for filesystems, DBs, whatever are are a very mature technology, I rely on that first to store my original source. Now you can re-index at will. So let's claim your data comes in from some stream somewhere. I'd 1> store it to the file system. 2> write a program to pull it off the file system and index. 3> Your comment about MapReduceIndexerTool is germane. You can re-index all that data very quickly. And it'll find files on your file system for you too! But I wouldn't even go there until I'd tried indexing my 10M docs straight with SolrJ or similar. If you can index your 10M docs in 1 hour and, by extrapolation your 100M docs in 10 hours, is that good enough? I don't know, it's your problem space after all ;). And is it acceptable to not see changes to the schema until tomorrow morning? If so, there's no need to get more complicated.... Best, Erick On Mon, May 26, 2014 at 9:00 AM, Shawn Heisey <s...@elyograg.org> wrote: > On 5/26/2014 7:50 AM, Ali Nazemian wrote: >> I was wondering which scenario (or the combination) would be better for my >> application. From the aspect of performance, scalability and high >> availability. Here is my application: >> >> Suppose I am going to have more than 10m documents and it grows every day. >> (probably in 1 years it reaches to more than 100m docs. I want to use Solr >> as tool for indexing these documents but the problem is I have some data >> fields that could change frequently. (not too much but it could change) > > Choosing which database software to use to hold your data is a problem > with many possible solutions. Everyone will have a different answer for > you. Each solution has strengths and weaknesses, and in the end, only > you can really know what your requirements are. > >> Scenarios: >> >> 1- Using SolrCloud as database for all data. (even the one that could be >> changed) > > If you choose to use Solr as a NoSQL, I would strongly recommend that > you have two Solr installs. The first install would be purely for data > storage and would have no indexed fields. If you can get machines with > enough RAM, it would also probably be preferable to use a single index > (or SolrCloud with one shard) for that install. The other install would > be for searching. Sharding would not be an issue on that index. The > reason that I make this recommendation is that when you use Solr for > searching, you have to do a complete reindex if you change your search > schema. It's difficult to reindex if the search index is also your > canonical data source. > >> 2- Using SolrCloud as database for static data and using RDBMS (such as >> oracle) for storing dynamic fields. > > I don't think it would be a good idea to have two canonical data > sources. Pick one. As already mentioned, Solr is better as a search > technology, serving up pointers to data in another data source, than as > a database. > > If you want to use RDBMS technology, why would you spend all that money > on Oracle? Just use one of the free databases. Our really large Solr > index comes from a database. At one time that database was in Oracle. > When my employer purchased the company with that database, we thought we > were obtaining a full Oracle license. It turns out we weren't. It > would have cost about half a million dollars to buy that license, so we > switched to MySQL. > > Since making that move to MySQL, performance is actually *better*. The > source table for our data has 96 million rows right now, growing at a > rate of a few million per year. This is completely in line with your > 100 million document requirement. For the massive table that feeds > Solr, we might switch to MongoDB, but that has not been decided yet. > > Later we switched from EasyAsk to Solr, a move that has *also* given us > better performance. Because both MySQL and Solr are free, we've > achieved a substantial cost savings. > >> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all >> data. > > I have no experience with this technology, but I think that if you are > thinking about a database on HDFS, you're probably actually talking > about HBase, the Apache implementation of Google's BigTable. > > Thanks, > Shawn >