Dear Erick, Thank you for you reply. Some parts of documents come from Nutch crawler and the other parts come from processing those documents. I really need it to be as fast as possible and 10 hours for indexing is not acceptable for my application. Regards.
On Mon, May 26, 2014 at 9:25 PM, Erick Erickson <erickerick...@gmail.com>wrote: > What you haven't told us is where the data comes from. But until > you put some numbers to it, it's hard to decide. > > I tend to prefer storing the data somewhere else, filesystem, whatever > and indexing to Solr when data changes. Even if that means re-indexing > the entire corpus. I don't like going to more complicated solutions until > that proves untenable. > > Backup/restore solutions for filesystems, DBs, whatever are are a very > mature technology, I rely on that first to store my original source. > > Now you can re-index at will. > > So let's claim your data comes in from some stream somewhere. I'd > 1> store it to the file system. > 2> write a program to pull it off the file system and index. > 3> Your comment about MapReduceIndexerTool is germane. You can re-index > all that data very quickly. And it'll find files on your file system > for you too! > > But I wouldn't even go there until I'd tried > indexing my 10M docs straight with SolrJ or similar. If you can index > your 10M docs > in 1 hour and, by extrapolation your 100M docs in 10 hours, is that good > enough? > I don't know, it's your problem space after all ;). And is it acceptable > to not > see changes to the schema until tomorrow morning? If so, there's no need > to get > more complicated.... > > Best, > Erick > > On Mon, May 26, 2014 at 9:00 AM, Shawn Heisey <s...@elyograg.org> wrote: > > On 5/26/2014 7:50 AM, Ali Nazemian wrote: > >> I was wondering which scenario (or the combination) would be better for > my > >> application. From the aspect of performance, scalability and high > >> availability. Here is my application: > >> > >> Suppose I am going to have more than 10m documents and it grows every > day. > >> (probably in 1 years it reaches to more than 100m docs. I want to use > Solr > >> as tool for indexing these documents but the problem is I have some data > >> fields that could change frequently. (not too much but it could change) > > > > Choosing which database software to use to hold your data is a problem > > with many possible solutions. Everyone will have a different answer for > > you. Each solution has strengths and weaknesses, and in the end, only > > you can really know what your requirements are. > > > >> Scenarios: > >> > >> 1- Using SolrCloud as database for all data. (even the one that could be > >> changed) > > > > If you choose to use Solr as a NoSQL, I would strongly recommend that > > you have two Solr installs. The first install would be purely for data > > storage and would have no indexed fields. If you can get machines with > > enough RAM, it would also probably be preferable to use a single index > > (or SolrCloud with one shard) for that install. The other install would > > be for searching. Sharding would not be an issue on that index. The > > reason that I make this recommendation is that when you use Solr for > > searching, you have to do a complete reindex if you change your search > > schema. It's difficult to reindex if the search index is also your > > canonical data source. > > > >> 2- Using SolrCloud as database for static data and using RDBMS (such as > >> oracle) for storing dynamic fields. > > > > I don't think it would be a good idea to have two canonical data > > sources. Pick one. As already mentioned, Solr is better as a search > > technology, serving up pointers to data in another data source, than as > > a database. > > > > If you want to use RDBMS technology, why would you spend all that money > > on Oracle? Just use one of the free databases. Our really large Solr > > index comes from a database. At one time that database was in Oracle. > > When my employer purchased the company with that database, we thought we > > were obtaining a full Oracle license. It turns out we weren't. It > > would have cost about half a million dollars to buy that license, so we > > switched to MySQL. > > > > Since making that move to MySQL, performance is actually *better*. The > > source table for our data has 96 million rows right now, growing at a > > rate of a few million per year. This is completely in line with your > > 100 million document requirement. For the massive table that feeds > > Solr, we might switch to MongoDB, but that has not been decided yet. > > > > Later we switched from EasyAsk to Solr, a move that has *also* given us > > better performance. Because both MySQL and Solr are free, we've > > achieved a substantial cost savings. > > > >> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for > all > >> data. > > > > I have no experience with this technology, but I think that if you are > > thinking about a database on HDFS, you're probably actually talking > > about HBase, the Apache implementation of Google's BigTable. > > > > Thanks, > > Shawn > > > -- A.Nazemian