[
https://issues.apache.org/jira/browse/HADOOP-12620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15045576#comment-15045576
]
Allen Wittenauer commented on HADOOP-12620:
-------------------------------------------
Given that every update to this JIRA is going to send the body of the
description out, can we move the majority of the body out to a comment or an
attachment? Thanks.
> Advanced Hadoop Architecture (AHA) - Common
> -------------------------------------------
>
> Key: HADOOP-12620
> URL: https://issues.apache.org/jira/browse/HADOOP-12620
> Project: Hadoop Common
> Issue Type: New Feature
> Reporter: Dinesh S. Atreya
>
> h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA)
> One main motivation for this JIRA is to address a comprehensive set of uses
> with just minimal enhancements to Hadoop to transition Hadoop from a Modern
> Data Architecture to Advanced/Cloud Data Architecture.
> HDFS has traditionally had a write-once-read-many access model for files
> until the introduction of “Append to files in HDFS” capability. The next
> minimal enhancements to core Hadoop include capability to do
> “updates-in-place” in HDFS.
> • Support seeks for writes (in addition to reads).
> • After seek, if the new byte length is the same as the old byte length,
> in place update is allowed.
> • Delete is an update with appropriate Delete marker
> • If byte length is different, old entry is marked as delete with new one
> appended as before.
> • It is client’s discretion to perform either update, append or both and
> the API changes in different Hadoop components should provide these
> capabilities.
> These minimal changes will enable laying the basis for transforming the core
> Hadoop to an interactive and real-time platform and introducing significant
> native capabilities to Hadoop. These enhancements will lay a foundation for
> all of the following processing styles to be supported natively and
> dynamically.
> • Real time
> • Mini-batch
> • Stream based data processing
> • Batch – which is the default now.
> Hadoop engines can dynamically choose processing style to use based on the
> type of data and volume of data sets and enhance/replace prevailing
> approaches.
> With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O
> resources with increasing efficiency. The Hadoop task engines can use
> vectorized/pipelined processing and greater use of memory throughout the
> Hadoop platform.
> These will enable enhanced performance optimizations to be implemented in
> HDFS and made available to all the Hadoop components. This will enable Fast
> processing of Big Data and enhance all the characteristics volume, velocity
> and variety of big data.
> There are many influences for this umbrella JIRA:
> • Preserve and Accelerate Hadoop
> • Efficient Data Management of variety of Data Formats natively in Hadoop
> • Enterprise Expansion
> • Internet and Media
> • Databases offer native support for a variety of Data Formats such as
> JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.
> It is quite probable that there may be many sub-JIRAs created to address
> portions of this. This JIRA captures a variety of use-cases in one place.
> Some Data Management /Platform initial use-cases are given hereunder:
> h2. Key-Value Store
> With the proposed enhancements, it will become very convenient to implement
> Key-Value Store natively in Hadoop.
> h2. MVCC
> Modified example of how MVCC can be implemented with the proposed
> enhancements from PostgreSQL MVCC is given hereunder.
> https://wiki.postgresql.org/wiki/MVCC
> http://momjian.us/main/writings/pgsql/mvcc.pdf
> || Data ID || Activity || Data Create || Data Expiry || Comments
> || || || Counter || Counter || Comments
> | 1 | Insert | 40 | MAX_VAL | Conventionally MAX_VAL is null.
> In order to maintain update size, MAX_VAL is pre-seeded for our purposes.
> | 1 | Delete | 40 | 47 | Marked as delete when current counter
> was 47.
> | 2 | Update (old Delete) | 64 | 78 | Mark old data is DELETE
> | 2 | Update (new insert) | 78 | MAX_VAL | Insert new data.
> Graph Stores
> Enable native storage and processing for a variety of graph stores.
> Graph Store 1 (Spark GraphX)
> 1. EdgeTable(pid, src, dst, data): stores the adjacency
> structure and edge data. Each edge is represented as a
> tuple consisting of the source vertex id, destination vertex id,
> and user-defined data as well as a virtual partition identifier
> (pid). Note that the edge table contains only the vertex ids
> and not the vertex data. The edge table is partitioned by the
> pid
> 2. VertexDataTable(id, data): stores the vertex data,
> in the form of a vertex (id, data) pairs. The vertex data table
> is indexed and partitioned by the vertex id.
> 3. VertexMap(id, pid): provides a mapping from the id
> of a vertex to the ids of the virtual partitions that contain
> adjacent edges.
> Graph Store 2 (Facebook Social Graph - TAO)
> Object: (id) → (otype,(key → value)∗ )
> Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )
> WEB
> With the AHA enhancements, a variety of Web standards can be natively
> supported such as updateable JSON (http://json.org/), XML, RDF and other
> documents.
> RDF
> RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/
> RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/
> The simplest triple statement is a sequence of (subject, predicate, object)
> terms, separated by whitespace and terminated by '.' after each triple.
> Mobile Apps Data and Resources
> With the enhancements proposed, in addition to the Web, Apps Data and
> Resources can also be managed using the Hadoop . Some examples of such usage
> can include App Data and Resources for Apple and other App stores.
> About Apps Resources:
> https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html
>
> On-Demand Resources Essentials:
> https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/
>
> Resource Programming Guide:
> https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf
>
> Temporal Data
> https://en.wikipedia.org/wiki/Temporal_database
> https://en.wikipedia.org/wiki/Valid_time
> In temporal data, data may get updated to reflect changes in data.
> For example data change from
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
> to
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
> Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
> Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)
> Media
> Media production typically involves a lot of changes and updates prior to
> release. The enhancements will lay a basis for the full lifecycle to be
> managed in Hadoop ecosystem.
> Indexes
> With the changes, a variety of updatable indexes can be supported natively in
> Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn
> leverage Hadoop’s enhanced native capabilities.
> Natural Support for ETL and Analytics
> With native support for updates and deletes in addition to appends/inserts,
> Hadoop will have proper and natural support for ETL and Analytics.
> Google References
> While Google’s research in this area is interesting (and some extracts are
> listed hereunder), the evolution of Hadoop is quite interesting. Proposed
> enhancements to support in-place-update to the core Hadoop will enable and
> make it easier for a variety of enhancements for each of the Hadoop
> components.
> We propose a basis for allowing a system for incrementally processing updates
> to large data sets and reduce the overhead of always having to do large
> batches. Hadoop engines can dynamically choose processing style to use based
> on the type of data and volume of data sets and enhance/replace prevailing
> approaches.
> Year Title Links
> 2015 Announcing Google Cloud Bigtable: The same database that powers Google
> Search, Gmail and Analytics is now available on Google Cloud Platform
> http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
> https://cloud.google.com/bigtable/
> 2014 Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf
>
> 2013 F1: A Distributed SQL Database That Scales
> http://research.google.com/pubs/pub41344.html
> 2013 Online, Asynchronous Schema Change in F1
> http://research.google.com/pubs/pub41376.html
> 2013 Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf
>
> 2012 F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad
> Business http://research.google.com/pubs/pub38125.html
> 2012 Spanner: Google's Globally-Distributed Database
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf
>
> 2012 Clydesdale: structured data processing on MapReduce
> http://dl.acm.org/citation.cfm?doid=2247596.2247600
> 2011 Megastore: Providing Scalable, Highly Available Storage for Interactive
> Services http://research.google.com/pubs/pub36971.html
> 2011 Tenzing A SQL Implementation On The MapReduce Framework
> http://research.google.com/pubs/pub37200.html
> 2010 Dremel: Interactive Analysis of Web-Scale Datasets
> http://research.google.com/pubs/pub36632.html
> 2010 FlumeJava: Easy, Efficient Data-Parallel Pipelines
> http://research.google.com/pubs/pub35650.html
> 2010 Percolator: Large-scale Incremental Processing Using Distributed
> Transactions and Notifications http://research.google.com/pubs/pub36726.html
> https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf
> Application Domains
> The enhancements will lay a path for comprehensive support of all application
> domains in Hadoop. A small collection is given hereunder.
> Data Warehousing and Enhanced ETL processing
> Supply Chain Planning
> Web Sites
> Mobile App Stores
> Financials
> Media
> Machine Learning
> Social Media
> Enterprise Applications such as ERP, CRM
> Corresponding umbrella JIRAs can be found for each of the following Hadoop
> platform components.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)