Hi, speaking about ES I think it would be fair to mention that one has to specify number of shards upfront when the index is created - that is correct, however, it is possible to give index one or more aliases which basically means that you can add new indices on the fly and give them same alias which is then used to search against. Given that you can add/remove indices, nodes and aliases on the fly I think there is a way how to handle growing data set with ease. If anyone is interested such scenario has been discussed in detail in ES mail list.
Regards, Lukas On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > One of big weaknesses of Solr Cloud (and ES?) is the lack of the > ability to redistribute shards across servers. Meaning, as a single > shard grows too large, splitting the shard, while live updates. > > How do you plan on elastically adding more servers without this feature? > > Cassandra and HBase handle elasticity in their own ways. Cassandra > has successfully implemented the Dynamo model and HBase uses the > traditional BigTable 'split'. Both systems are complex though are at > a singular level of maturity. > > Also Cassandra [successfully] implements multiple data center support, > is that available in SC or ES? > > On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic > <otis_gospodne...@yahoo.com> wrote: > > Hello Ali, > > > >> I'm trying to setup a large scale *Crawl + Index + Search > *infrastructure > > > >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web > pages*, > >> crawled + indexed every *4 weeks, *with a search latency of less than > 0.5 > >> seconds. > > > > > > That's fine. Whether it's doable with any tech will depend on how much > hardware you give it, among other things. > > > >> Needless to mention, the search index needs to scale to 5Billion pages. > It > >> is also possible that I might need to store multiple indexes -- one for > >> crawled content, and one for ancillary data that is also very large. > Each > >> of these indices would likely require a logically distributed and > >> replicated index. > > > > > > Yup, OK. > > > >> However, I would like for such a system to be homogenous with the Hadoop > >> infrastructure that is already installed on the cluster (for the > crawl). In > >> other words, I would much prefer if the replication and distribution of > the > >> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead > of > >> using another scalability framework (such as SolrCloud). In addition, it > >> would be ideal if this environment was flexible enough to be dynamically > >> scaled based on the size requirements of the index and the search > traffic > >> at the time (i.e. if it is deployed on an Amazon cluster, it should be > easy > >> enough to automatically provision additional processing power into the > >> cluster without requiring server re-starts). > > > > > > There is no such thing just yet. > > There is no Search+Hadoop/HDFS in a box just yet. There was an attempt > to automatically index HBase content, but that was either not completed or > not committed into HBase. > > > >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem > would > >> be ideal for this scenario. I've heard mention of Solr-on-HBase, > Solandra, > >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of > these is > >> mature enough and would be the right architectural choice to go along > with > >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling > aspects > >> above. > > > > > > Here is a summary on all of them: > > * Search on HBase - I assume you are referring to the same thing I > mentioned above. Not ready. > > * Solandra - uses Cassandra+Solr, plus DataStax now has a different > (commercial) offering that combines search and Cassandra. Looks good. > > * Lily - data stored in HBase cluster gets indexed to a separate Solr > instance(s) on the side. Not really integrated the way you want it to be. > > * ElasticSearch - solid at this point, the most dynamic solution today, > can scale well (we are working on a maaaany-B documents index and hundreds > of nodes with ElasticSearch right now), etc. But again, not integrated > with Hadoop the way you want it. > > * IndexTank - has some technical weaknesses, not integrated with Hadoop, > not sure about its future considering LinkedIn uses Zoie and Sensei already. > > * And there is SolrCloud, which is coming soon and will be solid, but is > again not integrated. > > > > If I were you and I had to pick today - I'd pick ElasticSearch if I were > completely open. If I had Solr bias I'd give SolrCloud a try first. > > > >> Lastly, how much hardware (assuming a medium sized EC2 instance) would > you > >> estimate my needing with this setup, for regular web-data (HTML text) at > >> this scale? > > > > I don't know off the topic of my head, but I'm guessing several hundred > for serving search requests. > > > > HTH, > > > > Otis > > -- > > Search Analytics - http://sematext.com/search-analytics/index.html > > > > Scalable Performance Monitoring - http://sematext.com/spm/index.html > > > > > >> Any architectural guidance would be greatly appreciated. The more > details > >> provided, the wider my grin :). > >> > >> Many many thanks in advance. > >> > >> Thanks, > >> Safdar > >> >