[jira] [Commented] (HADOOP-12620) Advanced Hadoop Architecture (AHA) - Common

Allen Wittenauer (JIRA) Mon, 07 Dec 2015 11:48:33 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-12620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15045576#comment-15045576
 ]


Allen Wittenauer commented on HADOOP-12620:
-------------------------------------------

Given that every update to this JIRA is going to send the body of the 
description out, can we move the majority of the body out to a comment or an 
attachment?  Thanks.

> Advanced Hadoop Architecture (AHA) - Common
> -------------------------------------------
>
>                 Key: HADOOP-12620
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12620
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Dinesh S. Atreya
>
> h1. Advance Hadoop Architecture(AHA) / Advance Hadoop Adaptabilities (AHA)
> One main motivation for this JIRA is to address a comprehensive set of uses 
> with just minimal enhancements to Hadoop to transition Hadoop from a Modern 
> Data Architecture to Advanced/Cloud Data Architecture. 
> HDFS has traditionally had a write-once-read-many access model for files 
> until the introduction of “Append to files in HDFS” capability. The next 
> minimal enhancements to core Hadoop include capability to do 
> “updates-in-place” in HDFS. 
> •     Support seeks for writes (in addition to reads).
> •     After seek, if the new byte length is the same as the old byte length, 
> in place update is allowed.
> •     Delete is an update with appropriate Delete marker
> •     If byte length is different, old entry is marked as delete with new one 
> appended as before. 
> •     It is client’s discretion to perform either update, append or both and 
> the API changes in different Hadoop components should provide these 
> capabilities.
> These minimal changes will enable laying the basis for transforming the core 
> Hadoop to an interactive and real-time platform and introducing significant 
> native capabilities to Hadoop. These enhancements will lay a foundation for 
> all of the following processing styles to be supported natively and 
> dynamically. 
> •     Real time 
> •     Mini-batch  
> •     Stream based data processing
> •     Batch – which is the default now.
> Hadoop engines can dynamically choose processing style to use based on the 
> type of data and volume of data sets and enhance/replace prevailing 
> approaches.
> With this Hadoop engines can evolve to utilize modern CPU, Memory and I/O 
> resources  with increasing efficiency. The Hadoop task engines can use 
> vectorized/pipelined processing and greater use of memory throughout the 
> Hadoop platform. 
> These will enable enhanced performance optimizations to be implemented in 
> HDFS and made available to all the Hadoop components. This will enable Fast 
> processing of Big Data and enhance all the characteristics volume, velocity 
> and variety of big data.
> There are many influences for this umbrella JIRA:
> •     Preserve and Accelerate Hadoop
> •     Efficient Data Management of variety of Data Formats natively in Hadoop
> •     Enterprise Expansion 
> •     Internet and Media 
> •     Databases offer native support for a variety of Data Formats such as 
> JSON, XML Indexes, and Temporal etc. – Hadoop should do the same.
> It is quite probable that there may be many sub-JIRAs created to address 
> portions of this. This JIRA captures a variety of use-cases in one place.  
> Some Data Management /Platform initial use-cases are given hereunder:
> h2. Key-Value Store
> With the proposed enhancements, it will become very convenient to implement 
> Key-Value Store natively in Hadoop.
> h2. MVCC 
> Modified example of how MVCC can be implemented with the proposed 
> enhancements from PostgreSQL MVCC is given hereunder. 
> https://wiki.postgresql.org/wiki/MVCC 
> http://momjian.us/main/writings/pgsql/mvcc.pdf 
> || Data ID || Activity || Data Create || Data Expiry || Comments
> ||               ||             || Counter       ||  Counter  || Comments
> | 1  | Insert | 40    | MAX_VAL       | Conventionally MAX_VAL is null.
> In order to maintain update size, MAX_VAL is pre-seeded for our purposes.
> | 1   | Delete        | 40    | 47    | Marked as delete when current counter 
> was 47.
> | 2   | Update (old Delete)   | 64    | 78    | Mark old data is DELETE
> | 2   | Update (new insert)   | 78    | MAX_VAL       | Insert new data.
> Graph Stores
> Enable native storage and processing for a variety of graph stores. 
> Graph Store 1 (Spark GraphX)
> 1. EdgeTable(pid, src, dst, data): stores the adjacency 
> structure and edge data. Each edge is represented as a
> tuple consisting of the source vertex id, destination vertex id,
> and user-defined data as well as a virtual partition identifier
> (pid). Note that the edge table contains only the vertex ids
> and not the vertex data. The edge table is partitioned by the
> pid
> 2. VertexDataTable(id, data): stores the vertex data,
> in the form of a vertex (id, data) pairs. The vertex data table
> is indexed and partitioned by the vertex id.
> 3. VertexMap(id, pid): provides a mapping from the id
> of a vertex to the ids of the virtual partitions that contain
> adjacent edges.  
> Graph Store 2 (Facebook Social Graph - TAO)
> Object:  (id) → (otype,(key → value)∗ )
> Assoc.: (id1,atype,id2) → (time,(key → value) ∗ )
> WEB
> With the AHA enhancements, a variety of Web standards can be natively 
> supported  such as updateable JSON (http://json.org/), XML, RDF and other 
> documents.
> RDF
> RDF Schema 1.1: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ 
> RDF Triple: http://www.w3.org/TR/2014/REC-n-triples-20140225/ 
> The simplest triple statement is a sequence of (subject, predicate, object) 
> terms, separated by whitespace and terminated by '.' after each triple.
> Mobile Apps Data and Resources
> With the enhancements proposed, in addition to the Web, Apps Data and 
> Resources can also be managed using the Hadoop . Some examples of such usage 
> can include App Data and Resources for Apple and other App stores.
> About Apps Resources: 
> https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/Introduction/Introduction.html
>  
> On-Demand Resources Essentials: 
> https://developer.apple.com/library/prerelease/ios/documentation/FileManagement/Conceptual/On_Demand_Resources_Guide/
>  
> Resource Programming Guide: 
> https://developer.apple.com/library/ios/documentation/Cocoa/Conceptual/LoadingResources/LoadingResources.pdf
>  
> Temporal Data 
> https://en.wikipedia.org/wiki/Temporal_database 
> https://en.wikipedia.org/wiki/Valid_time 
> In temporal data, data may get updated to reflect changes in data.
> For example data change from 
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Apr-2001)
> to
> Person(John Doe, Smallville, 3-Apr-1975, 26-Aug-1994)
> Person(John Doe, Bigtown, 26-Aug-1994, 1-Jun-1995)
> Person(John Doe, Beachy, 1-Jun-1995, 3-Sep-2000)
> Person(John Doe, Bigtown, 3-Sep-2000, 1-Apr-2001)
> Media
> Media production typically involves a lot of changes and updates prior to 
> release. The enhancements will lay a basis for the full lifecycle to be 
> managed in Hadoop ecosystem. 
> Indexes
> With the changes, a variety of updatable indexes can be supported natively in 
> Hadoop. Search software such as Solr, ElasticSearch etc. can then in turn 
> leverage Hadoop’s enhanced native capabilities. 
> Natural Support for ETL and Analytics
> With native support for updates and deletes in addition to appends/inserts, 
> Hadoop will have proper and natural support for ETL and Analytics.
> Google References
> While Google’s research in this area is interesting (and some extracts are 
> listed hereunder), the evolution of Hadoop is quite interesting. Proposed 
> enhancements to support in-place-update to the core Hadoop will enable and 
> make it easier for a variety of enhancements for each of the Hadoop 
> components.
> We propose a basis for allowing a system for incrementally processing updates 
> to large data sets and reduce the overhead of always having to do large 
> batches. Hadoop engines can dynamically choose processing style to use based 
> on the type of data and volume of data sets and enhance/replace prevailing 
> approaches.
> Year  Title   Links
> 2015  Announcing Google Cloud Bigtable: The same database that powers Google 
> Search, Gmail and Analytics is now available on Google Cloud Platform 
> http://googlecloudplatform.blogspot.co.uk/2015/05/introducing-Google-Cloud-Bigtable.html
> https://cloud.google.com/bigtable/ 
> 2014  Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing 
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/42851.pdf
>  
> 2013  F1: A Distributed SQL Database That Scales      
> http://research.google.com/pubs/pub41344.html 
> 2013  Online, Asynchronous Schema Change in F1        
> http://research.google.com/pubs/pub41376.html 
> 2013  Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams  
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41318.pdf
>  
> 2012  F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad 
> Business       http://research.google.com/pubs/pub38125.html 
> 2012  Spanner: Google's Globally-Distributed Database 
> http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/39966.pdf
>  
> 2012  Clydesdale: structured data processing on MapReduce     
> http://dl.acm.org/citation.cfm?doid=2247596.2247600 
> 2011  Megastore: Providing Scalable, Highly Available Storage for Interactive 
> Services        http://research.google.com/pubs/pub36971.html 
> 2011  Tenzing A SQL Implementation On The MapReduce Framework 
> http://research.google.com/pubs/pub37200.html 
> 2010  Dremel: Interactive Analysis of Web-Scale Datasets      
> http://research.google.com/pubs/pub36632.html 
> 2010  FlumeJava: Easy, Efficient Data-Parallel Pipelines      
> http://research.google.com/pubs/pub35650.html 
> 2010  Percolator: Large-scale Incremental Processing Using Distributed 
> Transactions and Notifications http://research.google.com/pubs/pub36726.html
> https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Peng.pdf 
> Application Domains
> The enhancements will lay a path for comprehensive support of all application 
> domains in Hadoop. A small collection is given hereunder.
> Data Warehousing and Enhanced ETL processing  
> Supply Chain Planning
> Web Sites 
> Mobile App Stores
> Financials 
> Media 
> Machine Learning
> Social Media
> Enterprise Applications such as ERP, CRM 
> Corresponding umbrella JIRAs can be found for each of the following Hadoop 
> platform components. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HADOOP-12620) Advanced Hadoop Architecture (AHA) - Common

Reply via email to