Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Jason Rutherglen Wed, 18 Apr 2012 05:23:28 -0700

I'm curious how on the fly updates are handled as a new shard is added
to an alias.  Eg, how does the system know to which shard to send an
update?


On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček <lukas.vl...@gmail.com> wrote:
> Hi,
>
> speaking about ES I think it would be fair to mention that one has to
> specify number of shards upfront when the index is created - that is
> correct, however, it is possible to give index one or more aliases which
> basically means that you can add new indices on the fly and give them same
> alias which is then used to search against. Given that you can add/remove
> indices, nodes and aliases on the fly I think there is a way how to handle
> growing data set with ease. If anyone is interested such scenario has been
> discussed in detail in ES mail list.
>
> Regards,
> Lukas
>
> On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> One of big weaknesses of Solr Cloud (and ES?) is the lack of the
>> ability to redistribute shards across servers.  Meaning, as a single
>> shard grows too large, splitting the shard, while live updates.
>>
>> How do you plan on elastically adding more servers without this feature?
>>
>> Cassandra and HBase handle elasticity in their own ways.  Cassandra
>> has successfully implemented the Dynamo model and HBase uses the
>> traditional BigTable 'split'.  Both systems are complex though are at
>> a singular level of maturity.
>>
>> Also Cassandra [successfully] implements multiple data center support,
>> is that available in SC or ES?
>>
>> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
>> <otis_gospodne...@yahoo.com> wrote:
>> > Hello Ali,
>> >
>> >> I'm trying to setup a large scale *Crawl + Index + Search
>> *infrastructure
>> >
>> >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
>> pages*,
>> >> crawled + indexed every *4 weeks, *with a search latency of less than
>> 0.5
>> >> seconds.
>> >
>> >
>> > That's fine.  Whether it's doable with any tech will depend on how much
>> hardware you give it, among other things.
>> >
>> >> Needless to mention, the search index needs to scale to 5Billion pages.
>> It
>> >> is also possible that I might need to store multiple indexes -- one for
>> >> crawled content, and one for ancillary data that is also very large.
>> Each
>> >> of these indices would likely require a logically distributed and
>> >> replicated index.
>> >
>> >
>> > Yup, OK.
>> >
>> >> However, I would like for such a system to be homogenous with the Hadoop
>> >> infrastructure that is already installed on the cluster (for the
>> crawl). In
>> >> other words, I would much prefer if the replication and distribution of
>> the
>> >> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead
>> of
>> >> using another scalability framework (such as SolrCloud). In addition, it
>> >> would be ideal if this environment was flexible enough to be dynamically
>> >> scaled based on the size requirements of the index and the search
>> traffic
>> >> at the time (i.e. if it is deployed on an Amazon cluster, it should be
>> easy
>> >> enough to automatically provision additional processing power into the
>> >> cluster without requiring server re-starts).
>> >
>> >
>> > There is no such thing just yet.
>> > There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt
>> to automatically index HBase content, but that was either not completed or
>> not committed into HBase.
>> >
>> >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem
>> would
>> >> be ideal for this scenario. I've heard mention of Solr-on-HBase,
>> Solandra,
>> >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of
>> these is
>> >> mature enough and would be the right architectural choice to go along
>> with
>> >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
>> aspects
>> >> above.
>> >
>> >
>> > Here is a summary on all of them:
>> > * Search on HBase - I assume you are referring to the same thing I
>> mentioned above.  Not ready.
>> > * Solandra - uses Cassandra+Solr, plus DataStax now has a different
>> (commercial) offering that combines search and Cassandra.  Looks good.
>> > * Lily - data stored in HBase cluster gets indexed to a separate Solr
>> instance(s)  on the side.  Not really integrated the way you want it to be.
>> > * ElasticSearch - solid at this point, the most dynamic solution today,
>> can scale well (we are working on a maaaany-B documents index and hundreds
>> of nodes with ElasticSearch right now), etc.  But again, not integrated
>> with Hadoop the way you want it.
>> > * IndexTank - has some technical weaknesses, not integrated with Hadoop,
>> not sure about its future considering LinkedIn uses Zoie and Sensei already.
>> > * And there is SolrCloud, which is coming soon and will be solid, but is
>> again not integrated.
>> >
>> > If I were you and I had to pick today - I'd pick ElasticSearch if I were
>> completely open.  If I had Solr bias I'd give SolrCloud a try first.
>> >
>> >> Lastly, how much hardware (assuming a medium sized EC2 instance) would
>> you
>> >> estimate my needing with this setup, for regular web-data (HTML text) at
>> >> this scale?
>> >
>> > I don't know off the topic of my head, but I'm guessing several hundred
>> for serving search requests.
>> >
>> > HTH,
>> >
>> > Otis
>> > --
>> > Search Analytics - http://sematext.com/search-analytics/index.html
>> >
>> > Scalable Performance Monitoring - http://sematext.com/spm/index.html
>> >
>> >
>> >> Any architectural guidance would be greatly appreciated. The more
>> details
>> >> provided, the wider my grin :).
>> >>
>> >> Many many thanks in advance.
>> >>
>> >> Thanks,
>> >> Safdar
>> >>
>>

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Reply via email to