Re: Replication for SolrCloud

gengmao Sun, 19 Apr 2015 01:00:07 -0700

Please see my response in line:

On Fri, Apr 17, 2015 at 10:59 PM Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:


> Some comments inline:
>
> On Sat, Apr 18, 2015 at 2:12 PM, gengmao <geng...@gmail.com> wrote:
>
> > On Sat, Apr 18, 2015 at 12:20 AM "Jürgen Wagner (DVT)" <
> > juergen.wag...@devoteam.com> wrote:
> >
> > >  Replication on the storage layer will provide a reliable storage for
> the
> > > index and other data of Solr. In particular, this replication does not
> > > guarantee your index files are consistent at any time as there may be
> > > intermediate states that are only partially replicated. Replication is
> > only
> > > a convergent process, not an instant, atomic operation. With frequent
> > > changes, this becomes an issue.
> > >
> > Firstly thanks for your reply. However I can't agree with you on this.
> > HDFS guarantees the consistency even with replicates - you always read
> what
> > you write, no partially replicated state will be read, which is
> guaranteed
> > by HDFS server and client. Hence HBase can rely on HDFS for consistency
> and
> > availability, without implementing another replication mechanism - if I
> > understand correctly.
> >
> >
> Lucene index is not one file but a collection of files which are written
> independently. So if you replicate them out of order, Lucene might consider
> the index as corrupted (because of missing files). I don't think HBase
> works in that way.
>
Again HDFS replication is transparent to HBase. You can set HDFS
replication factor to 1 and HBase will still work, but it will lose the
fault tolerance to any disk failure which is provided by HDFS replicates.
Also HBase doesn't directly utilize HDFS replicates. Increase HDFS
replication factors won't improve HBase's scalability. To achieve better
read/write throughput, split shards is the only approach.


>
> >
> > > Replication inside SolrCloud as an application will not only maintain
> the
> > > consistency of the search-level interfaces to your indexes, but also
> > scale
> > > in the sense of the application (query throughput).
> > >
> >  Split one shard into two shards can increase the query throughput too.
> >
> >
> > > Imagine a database: if you change one record, this may also result in
> an
> > > index change. If the record and the index are stored in different
> storage
> > > blocks, one will get replicated first. However, the replication target
> > will
> > > only be consistent again when both have been replicated. So, you would
> > have
> > > to suspend all accesses until the entire replication has completed.
> > That's
> > > undesirable. If you replicate on the application (database management
> > > system) level, the application will employ a more fine-grained approach
> > to
> > > replication, guaranteeing application consistency.
> > >
> > In HBase, a region only locates on single region server at any time,
> which
> > guarantee its consistency. Because your read/write always drops in one
> > region, you won't have concern of parallel writes happens on multiple
> > replicates of same region.
> > The replication of HDFS is totally transparent to HBase. When a HDFS
> write
> > call returns, HBase know the data is written and replicated so losing one
> > copy of the data won't impact HBase at all.
> > So HDFS means consistency and reliability for HBase. However, HBase
> doesn't
> > use replicates (either HBase itself or HDFS's) to scale reads. If one
> > region's is too "hot" for reads or write, you split that region into two
> > regions, so that the reads and writes of that region can be distributed
> > into two region servers. Hence HBase scales.
> > I think this is the simplicity and beauty of HBase. Again, I am curious
> if
> > SolrCloud has better reason to use replication on HDFS? As I described,
> > HDFS provided consistency and reliability, meanwhile scalability can be
> > achieved via sharding, even without Solr replication.
> >
> >
> That's something that has been considered and may even be in the roadmap
> for the Cloudera guys. See https://issues.apache.org/jira/browse/SOLR-6237
>
> But one problem that isn't solved by HDFS replication is of near-real-time
> indexing where you want the documents to be available for searchers as fast
> as possible. SolrCloud replication supports that by replicating documents
> as they come in and indexing them in several replicas. A new index searcher
> is opened on the flushed index files as well as on the internal data
> structures of the index writer. If we switch to relying on HDFS replication
> then this will be awfully expensive. However, as Jürgen mentioned, HDFS can
> certainly help with replicating static indexes
>
My understanding is "near-real-time" indexing is not necessary to rely on
replication.
https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching
just describes "soft commit" but doesn't mention replication. Also the
Cloudera Search, which is Solr based on HDFS, claims near-real-time
indexing however doesn't mention replication too. Quote from
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-search/v1-latest/Cloudera-Search-User-Guide/csug_introducing.html
:
"In a near-real-time indexing use case, Cloudera Search indexes events that
are streamed through Apache Flume on their way into storage in CDH. Fields
and events are mapped to standard Solr indexable schemas. Lucene indexes
events, and the integration through Cloudera Search allows the index to be
directly written and stored in standard Lucene index files in HDFS. Flume’s
capabilities to route events and have data stored in partitions in HDFS can
also be applied. Events can be routed and streamed through multiple Flume
agents and written to separate Lucene indexers that can write into separate
index shards, for better scale when indexing and quicker responses when
searching. The indexes are loaded from HDFS to Solr cores, exactly like
Solr would have read from local disk. The difference in the design of
Cloudera Search is the robust, distributed, and scalable storage layer of
HDFS, which helps eliminate costly downtime and allows for flexibility
across workloads without having to move data"


> >
> > > Consequently, HDFS will allow you to scale storage and possibly even
> > > replicate static indexes that won't change, but it won't help much with
> > > live index replication. That's where SolrCloud jumps in.
> > >
> >
> > > Cheers,
> > > --Jürgen
> > >
> > >
> > > On 18.04.2015 08:44, gengmao wrote:
> > >
> > > I wonder why need to use SolrCloud replication on HDFS at all, given
> HDFS
> > > already provides replication and availability? The way to optimize
> > > performance and scalability should be tweaking shards, just like
> tweaking
> > > regions on HBase - which doesn't provide "region replication" too,
> isn't
> > > it?
> > >
> > > I have this question for a while and I didn't find clear answer about
> it.
> > > Could some experts please explain a bit?
> > >
> > > Best regards,
> > > Mao Geng
> > >
> > >
> > >
> > >
> > >
> > > --
> > >
> > > Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
> > > уважением
> > > *i.A. Jürgen Wagner*
> > > Head of Competence Center "Intelligence"
> > > & Senior Cloud Consultant
> > >
> > > Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
> > > Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864
> > 1543
> > > E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de
> > > ------------------------------
> > > Managing Board: Jürgen Hatzipantelis (CEO)
> > > Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
> > > Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
> > >
> > >
> > >
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Replication for SolrCloud

Reply via email to