Few things will help here if you can clarify what is acceptable in terms of indexing hours & what is the use case for indexing
· Are you looking to re-index all data (say 100 m) frequently that you need indexing hours to be on lower side (<10 or <5 etc.). If so how many reasonable hours you expect it take · Or you can afford to not re-index all data and add incremental indexing (Not sure how frequently your schema fields gets changed as mentioned by you) Also as Eric pointed out using SolrJ and using Parallelism you can achieve indexing quickly. We recently had a use case where we indexed around 10m docs from database in less than ½ hr. Thanks, Susheel -----Original Message----- From: Ali Nazemian [mailto:alinazem...@gmail.com] Sent: Monday, May 26, 2014 2:42 PM To: solr-user@lucene.apache.org Subject: Re: Using SolrCloud with RDBMS or without Dear Erick, Thank you for you reply. Some parts of documents come from Nutch crawler and the other parts come from processing those documents. I really need it to be as fast as possible and 10 hours for indexing is not acceptable for my application. Regards. On Mon, May 26, 2014 at 9:25 PM, Erick Erickson <erickerick...@gmail.com<mailto:erickerick...@gmail.com>>wrote: > What you haven't told us is where the data comes from. But until you > put some numbers to it, it's hard to decide. > > I tend to prefer storing the data somewhere else, filesystem, whatever > and indexing to Solr when data changes. Even if that means re-indexing > the entire corpus. I don't like going to more complicated solutions > until that proves untenable. > > Backup/restore solutions for filesystems, DBs, whatever are are a very > mature technology, I rely on that first to store my original source. > > Now you can re-index at will. > > So let's claim your data comes in from some stream somewhere. I'd > 1> store it to the file system. > 2> write a program to pull it off the file system and index. > 3> Your comment about MapReduceIndexerTool is germane. You can > 3> re-index > all that data very quickly. And it'll find files on your file system > for you too! > > But I wouldn't even go there until I'd tried indexing my 10M docs > straight with SolrJ or similar. If you can index your 10M docs in 1 > hour and, by extrapolation your 100M docs in 10 hours, is that good > enough? > I don't know, it's your problem space after all ;). And is it > acceptable to not see changes to the schema until tomorrow morning? If > so, there's no need to get more complicated.... > > Best, > Erick > > On Mon, May 26, 2014 at 9:00 AM, Shawn Heisey > <s...@elyograg.org<mailto:s...@elyograg.org>> wrote: > > On 5/26/2014 7:50 AM, Ali Nazemian wrote: > >> I was wondering which scenario (or the combination) would be better > >> for > my > >> application. From the aspect of performance, scalability and high > >> availability. Here is my application: > >> > >> Suppose I am going to have more than 10m documents and it grows > >> every > day. > >> (probably in 1 years it reaches to more than 100m docs. I want to > >> use > Solr > >> as tool for indexing these documents but the problem is I have some > >> data fields that could change frequently. (not too much but it > >> could change) > > > > Choosing which database software to use to hold your data is a > > problem with many possible solutions. Everyone will have a > > different answer for you. Each solution has strengths and > > weaknesses, and in the end, only you can really know what your requirements > > are. > > > >> Scenarios: > >> > >> 1- Using SolrCloud as database for all data. (even the one that > >> could be > >> changed) > > > > If you choose to use Solr as a NoSQL, I would strongly recommend > > that you have two Solr installs. The first install would be purely > > for data storage and would have no indexed fields. If you can get > > machines with enough RAM, it would also probably be preferable to > > use a single index (or SolrCloud with one shard) for that install. > > The other install would be for searching. Sharding would not be an > > issue on that index. The reason that I make this recommendation is > > that when you use Solr for searching, you have to do a complete > > reindex if you change your search schema. It's difficult to reindex > > if the search index is also your canonical data source. > > > >> 2- Using SolrCloud as database for static data and using RDBMS > >> (such as > >> oracle) for storing dynamic fields. > > > > I don't think it would be a good idea to have two canonical data > > sources. Pick one. As already mentioned, Solr is better as a > > search technology, serving up pointers to data in another data > > source, than as a database. > > > > If you want to use RDBMS technology, why would you spend all that > > money on Oracle? Just use one of the free databases. Our really > > large Solr index comes from a database. At one time that database was in > > Oracle. > > When my employer purchased the company with that database, we > > thought we were obtaining a full Oracle license. It turns out we > > weren't. It would have cost about half a million dollars to buy > > that license, so we switched to MySQL. > > > > Since making that move to MySQL, performance is actually *better*. > > The source table for our data has 96 million rows right now, growing > > at a rate of a few million per year. This is completely in line > > with your > > 100 million document requirement. For the massive table that feeds > > Solr, we might switch to MongoDB, but that has not been decided yet. > > > > Later we switched from EasyAsk to Solr, a move that has *also* given > > us better performance. Because both MySQL and Solr are free, we've > > achieved a substantial cost savings. > > > >> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) > >> for > all > >> data. > > > > I have no experience with this technology, but I think that if you > > are thinking about a database on HDFS, you're probably actually > > talking about HBase, the Apache implementation of Google's BigTable. > > > > Thanks, > > Shawn > > > -- A.Nazemian This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. E-mails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, or contain viruses. Anyone who communicates with us by e-mail is deemed to have accepted these risks. The Digital Group is not responsible for errors or omissions in this message and denies any responsibility for any damage arising from the use of e-mail. Any opinion defamatory or deemed to be defamatory or any material which could be reasonably branded to be a species of plagiarism and other statements contained in this message and any attachment are solely those of the author and do not necessarily represent those of the company.