[ https://issues.apache.org/jira/browse/SOLR-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253596#comment-17253596 ]
Kevin Risden commented on SOLR-15051: ------------------------------------- I can't add a comment to the design doc, but wanted to address potentially misleading statements around the Solr HDFS integration. {quote}Has an unfortunate search performance penalty. TODO ___ %. Some indexing penalty too: ___ %.{quote} There will be a performance penalty here coming from remote storage. I don't think this is completely avoidable. The biggest issue is on the indexing side where we need to ensure that documents are reliably written, but this isn't exactly fast on remote storage. {quote}The implementation relies on a “BlockCache”, which means running Solr with large Java heaps.{quote} The BlockCache is off heap typically with Java direct memory so shouldn't require a large Java heap. {quote} It’s not a generalized shared storage scheme; it’s HDFS specific. It’s possible to plug in S3 and Alluxio to this but there is overhead. HDFS is rather complex to operate, whereas say S3 is provided by cloud hosting providers natively. {quote} I'm not sure I understand this statement. There are a few parts to Hadoop. HDFS is the storage layer that can be complex to operate. The more interesting part is the Hadoop filesystem interface that is a semi generic adapter between the HDFS API and other storage backends (S3, ABFS, GCS, etc). The two pieces are separate and don't require each other to operate. The Hadoop filesystem interface provides the abstraction necessary to go between local filesystem to a lot of other cloud provider storage mechanisms. There may be some overhead there, but I know there has been a lot of work in the past 1-2 years where the performance has been improved since there has been a push for cloud storage support. > Shared storage -- BlobDirectory (de-duping) > ------------------------------------------- > > Key: SOLR-15051 > URL: https://issues.apache.org/jira/browse/SOLR-15051 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: David Smiley > Assignee: David Smiley > Priority: Major > > This proposal is a way to accomplish shared storage in SolrCloud with a few > key characteristics: (A) using a Directory implementation, (B) delegates to a > backing local file Directory as a kind of read/write cache (C) replicas have > their own "space", (D) , de-duplication across replicas via reference > counting, (E) uses ZK but separately from SolrCloud stuff. > The Directory abstraction is a good one, and helps isolate shared storage > from the rest of SolrCloud that doesn't care. Using a backing normal file > Directory is faster for reads and is simpler than Solr's HDFSDirectory's > BlockCache. Replicas having their own space solves the problem of multiple > writers (e.g. of the same shard) trying to own and write to the same space, > and it implies that any of Solr's replica types can be used along with what > goes along with them like peer-to-peer replication (sometimes faster/cheaper > than pulling from shared storage). A de-duplication feature solves needless > duplication of files across replicas and from parent shards (i.e. from shard > splitting). The de-duplication feature requires a place to cache directory > listings so that they can be shared across replicas and atomically updated; > this is handled via ZooKeeper. Finally, some sort of Solr daemon / > auto-scaling code should be added to implement "autoAddReplicas", especially > to provide for a scenario where the leader is gone and can't be replicated > from directly but we can access shared storage. > For more about shared storage concepts, consider looking at the description > in SOLR-13101 and the linked Google Doc. > *[PROPOSAL > DOC|https://docs.google.com/document/d/1kjQPK80sLiZJyRjek_Edhokfc5q9S3ISvFRM2_YeL8M/edit?usp=sharing]* -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org