I'm curious how on the fly updates are handled as a new shard is added to an alias. Eg, how does the system know to which shard to send an update?
On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček <lukas.vl...@gmail.com> wrote: > Hi, > > speaking about ES I think it would be fair to mention that one has to > specify number of shards upfront when the index is created - that is > correct, however, it is possible to give index one or more aliases which > basically means that you can add new indices on the fly and give them same > alias which is then used to search against. Given that you can add/remove > indices, nodes and aliases on the fly I think there is a way how to handle > growing data set with ease. If anyone is interested such scenario has been > discussed in detail in ES mail list. > > Regards, > Lukas > > On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen < > jason.rutherg...@gmail.com> wrote: > >> One of big weaknesses of Solr Cloud (and ES?) is the lack of the >> ability to redistribute shards across servers. Meaning, as a single >> shard grows too large, splitting the shard, while live updates. >> >> How do you plan on elastically adding more servers without this feature? >> >> Cassandra and HBase handle elasticity in their own ways. Cassandra >> has successfully implemented the Dynamo model and HBase uses the >> traditional BigTable 'split'. Both systems are complex though are at >> a singular level of maturity. >> >> Also Cassandra [successfully] implements multiple data center support, >> is that available in SC or ES? >> >> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic >> <otis_gospodne...@yahoo.com> wrote: >> > Hello Ali, >> > >> >> I'm trying to setup a large scale *Crawl + Index + Search >> *infrastructure >> > >> >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web >> pages*, >> >> crawled + indexed every *4 weeks, *with a search latency of less than >> 0.5 >> >> seconds. >> > >> > >> > That's fine. Whether it's doable with any tech will depend on how much >> hardware you give it, among other things. >> > >> >> Needless to mention, the search index needs to scale to 5Billion pages. >> It >> >> is also possible that I might need to store multiple indexes -- one for >> >> crawled content, and one for ancillary data that is also very large. >> Each >> >> of these indices would likely require a logically distributed and >> >> replicated index. >> > >> > >> > Yup, OK. >> > >> >> However, I would like for such a system to be homogenous with the Hadoop >> >> infrastructure that is already installed on the cluster (for the >> crawl). In >> >> other words, I would much prefer if the replication and distribution of >> the >> >> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead >> of >> >> using another scalability framework (such as SolrCloud). In addition, it >> >> would be ideal if this environment was flexible enough to be dynamically >> >> scaled based on the size requirements of the index and the search >> traffic >> >> at the time (i.e. if it is deployed on an Amazon cluster, it should be >> easy >> >> enough to automatically provision additional processing power into the >> >> cluster without requiring server re-starts). >> > >> > >> > There is no such thing just yet. >> > There is no Search+Hadoop/HDFS in a box just yet. There was an attempt >> to automatically index HBase content, but that was either not completed or >> not committed into HBase. >> > >> >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem >> would >> >> be ideal for this scenario. I've heard mention of Solr-on-HBase, >> Solandra, >> >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of >> these is >> >> mature enough and would be the right architectural choice to go along >> with >> >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling >> aspects >> >> above. >> > >> > >> > Here is a summary on all of them: >> > * Search on HBase - I assume you are referring to the same thing I >> mentioned above. Not ready. >> > * Solandra - uses Cassandra+Solr, plus DataStax now has a different >> (commercial) offering that combines search and Cassandra. Looks good. >> > * Lily - data stored in HBase cluster gets indexed to a separate Solr >> instance(s) on the side. Not really integrated the way you want it to be. >> > * ElasticSearch - solid at this point, the most dynamic solution today, >> can scale well (we are working on a maaaany-B documents index and hundreds >> of nodes with ElasticSearch right now), etc. But again, not integrated >> with Hadoop the way you want it. >> > * IndexTank - has some technical weaknesses, not integrated with Hadoop, >> not sure about its future considering LinkedIn uses Zoie and Sensei already. >> > * And there is SolrCloud, which is coming soon and will be solid, but is >> again not integrated. >> > >> > If I were you and I had to pick today - I'd pick ElasticSearch if I were >> completely open. If I had Solr bias I'd give SolrCloud a try first. >> > >> >> Lastly, how much hardware (assuming a medium sized EC2 instance) would >> you >> >> estimate my needing with this setup, for regular web-data (HTML text) at >> >> this scale? >> > >> > I don't know off the topic of my head, but I'm guessing several hundred >> for serving search requests. >> > >> > HTH, >> > >> > Otis >> > -- >> > Search Analytics - http://sematext.com/search-analytics/index.html >> > >> > Scalable Performance Monitoring - http://sematext.com/spm/index.html >> > >> > >> >> Any architectural guidance would be greatly appreciated. The more >> details >> >> provided, the wider my grin :). >> >> >> >> Many many thanks in advance. >> >> >> >> Thanks, >> >> Safdar >> >> >>