Re: [VOTE] Bring Griffin to Apache Incubator

Gangumalla, Uma Thu, 01 Dec 2016 11:24:07 -0800

+1 (binding)

Regards,
Uma


On 11/30/16, 10:40 PM, "Henry Saputra" <[email protected]> wrote:

>Hi All,
>
>As the champion for Griffin, I would like to start VOTE to bring  the
>project as Apache incubator podling.
>
>Here is the direct quote from the abstract:
>
>"
>Griffin is a Data Quality Service platform built on Apache Hadoop and
>Apache Spark. It provides a framework process for defining data
>quality model, executing data quality measurement, automating data
>profiling and validation, as well as a unified data quality
>visualization across multiple data systems. It tries to address the
>data quality challenges in big data and streaming context.
>"
>
>Please cast your vote:
>
>[ ] +1, bring Griffin into Incubator
>[ ] +0, I don't care either way,
>[ ] -1, do not bring Griffin into Incubator, because...
>
>This vote will be open at least for 72 hours and only votes from the
>Incubator PMC are binding.
>
>The VOTE will end 12/5 9am PST to pass through weekend.
>
>
>Here is the link to the proposal:
>
>https://wiki.apache.org/incubator/GriffinProposal
>
>I have copied the proposal below for easy access
>
>
>Thanks,
>
>- Henry
>
>
>
>Griffin Proposal
>
>Abstract
>
>Griffin is a Data Quality Service platform built on Apache Hadoop and
>Apache Spark. It provides a framework process for defining data
>quality model, executing data quality measurement, automating data
>profiling and validation, as well as a unified data quality
>visualization across multiple data systems. It tries to address the
>data quality challenges in big data and streaming context.
>
>Proposal
>
>Griffin is a open source Data Quality solution for distributed data
>systems at any scale in both streaming or batch data context. When
>people use open source products (e.g. Apache Hadoop, Apache Spark,
>Apache Kafka, Apache Storm), they always need a data quality service
>to build his/her confidence on data quality processed by those
>platforms. Griffin creates a unified process to define and construct
>data quality measurement pipeline across multiple data systems to
>provide:
>
>Automatic quality validation of the data
>Data profiling and anomaly detection
>Data quality lineage from upstream to downstream data systems.
>Data quality health monitoring visualization
>Shared infrastructure resource management
>
>Overview of Griffin
>
>Griffin has been deployed in production at eBay serving major data
>systems, it takes a platform approach to provide generic features to
>solve common data quality validation pain points. Firstly, user can
>register the data asset which user wants to do data quality check. The
>data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
>system or near real-time streaming data from Apache Kafka, Apache
>Storm and other real time data platforms. Secondly, user can create
>data quality model to define the data quality rule and metadata.
>Thirdly, the model or rule will be executed automatically (by the
>model engine) to get the sample data quality validation results in a
>few seconds for streaming data. Finally, user can analyze the data
>quality results through built-in visualization tool to take actions.
>
>Griffin includes:
>
>Data Quality Model Engine
>
>Griffin is model driven solution, user can choose various data quality
>dimension to execute his/her data quality validation based on selected
>target data-set or source data-set ( as the golden reference data). It
>has a corresponding library supporting it in back-end for the
>following measurement:
>
>Accuracy - Does data reflect the real-world objects or a verifiable source
>Completeness - Is all necessary data present
>Validity - Are all data values within the data domains specified by the
>business
>Timeliness - Is the data available at the time needed
>Anomaly detection - Pre-built algorithm functions for the
>identification of items, events or observations which do not conform
>to an expected pattern or other items in a dataset
>Data Profiling - Apply statistical analysis and assessment of data
>values within a dataset for consistency, uniqueness and logic.
>
>Data Collection Layer
>
>We support two kinds of data sources, batch data and real time data.
>
>For batch mode, we can collect data source from Apache Hadoop based
>platform by various data connectors.
>
>For real time mode, we can connect with messaging system like Kafka to
>near real time analysis.
>
>Data Process and Storage Layer
>
>For batch analysis, our data quality model will compute data quality
>metrics in our spark cluster based on data source in Apache Hadoop.
>
>For near real time analysis, we consume data from messaging system,
>then our data quality model will compute our real time data quality
>metrics in our spark cluster. for data storage, we use time series
>database in our back end to fulfill front end request.
>
>Griffin Service
>
>We have RESTful web services to accomplish all the functionalities of
>Griffin, such as register data asset, create data quality model,
>publish metrics, retrieve metrics, add subscription, etc. So, the
>developers can develop their own user interface based on these web
>services.
>
>Background
>
>At eBay, when people play with big data in Apache Hadoop (or other
>streaming data), data quality often becomes one big challenge.
>Different teams have built customized data quality tools to detect and
>analyze data quality issues within their own domain. We are thinking
>to take a platform approach to provide shared Infrastructure and
>generic features to solve common data quality pain points. This would
>enable us to build trusted data assets.
>
>Currently it¹s very difficult and costly to do data quality validation
>when we have big data flow across multi-platforms at eBay (e.g.
>Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
>MongoDB). Take eBay real time personalization platform as an example.
>Every day we have to validate data quality status for ~600M records (
>imagine we have 150M active users for our website). Data quality often
>becomes one big challenge both in its streaming and batch pipelines.
>
>So we conclude 3 data quality problems at eBay:
>
>Lack of end2end unified view of data quality measurement from multiple
>data sources to target applications, it usually takes a long time to
>identify and fix poor data quality.
>How to get data quality measured in streaming mode, we need to have a
>process and tool to visualize data quality insights through
>registering dataset which you want to check data quality, creating
>data quality measurement model, executing the data quality validation
>job and getting metrics insights for action taking.
>No Shared platform and API Service, have to apply and manage own
>hardware and software infrastructure.
>
>Rationale
>
>The challenge we face at eBay is that our data volume is becoming
>bigger and bigger, system processes become more complex, while we do
>not have a unified data quality solution to ensure the trusted data
>sets which provide confidences on data quality to our data consumers.
>The key challenges on data quality includes:
>
>Existing commercial data quality solution cannot address data quality
>lineage among systems, cannot scale out to support fast growing data
>at eBay
>Existing eBay's domain specific tools take a long time to identify and
>fix poor data quality when data flowed through multiple systems
>Business logic becomes complex, requires data quality system much
>flexible.
>
>Some data quality issues do have business impact on user experiences,
>revenue, efficiency & compliance.
>
>Communication overhead of data quality metrics, typically in a big
>organization, which involve different teams.
>
>The idea of Griffin is to provide Data Quality validation as a
>Service, to allow data engineers and data consumers to have:
>
>Near real-time understanding of the data quality health of your data
>pipelines with end-to-end monitoring, all in one place.
>Profiling, detecting and correlating issues and providing
>recommendations that drive rapid and focused troubleshooting
>A centralized data quality model management system including rule,
>metadata, scheduler etc.
>Native code generation to run everywhere, including Hadoop, Kafka, Spark,
>etc.
>One set of tools to build data quality pipelines across all eBay data
>platforms.
>
>Current Status
>
>Meritocracy
>
>Griffin has been deployed in production at eBay and provided the
>centralized data quality service for several eBay systems ( for
>example, real time personalization platform, eBay real time ID linking
>platform, Hadoop datasets, Site speed analytics platform). Our aim is
>to build a diverse developer and user community following the Apache
>meritocracy model. We will encourage contributions and participation
>of all types of work, and ensure that contributors are appropriately
>recognized.
>
>Community
>
>Currently the project is being developed at eBay. It's only for eBay
>internal community. Griffin seeks to develop the developer and user
>communities during incubation. We believe it will grow substantially
>by becoming an Apache project.
>
>Core Developers
>
>Griffin is currently being designed and developed by engineers from
>eBay Inc.  William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
>All of these core developers have deep expertise in Apache Hadoop and
>the Hadoop Ecosystem in general.
>
>Alignment
>
>The ASF is a natural host for Griffin given that it is already the
>home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
>emerging big data products. Those are requiring data quality solution
>by nature to ensure the data quality which they processed. When people
>use open source data technology, the big question to them is that how
>we can ensure the data quality in it. Griffin leverages lot of Apache
>open-source products. Griffin was designed to enable real time
>insights into data quality validation by shared Infrastructure and
>generic features to solve common data quality pain points.
>
>Known Risks
>
>Orphaned Products
>
>The core developers of Griffin team work full time on this project.
>There is no risk of Griffin getting orphaned since at least one large
>company (eBay) is extensively using it in their production Hadoop and
>Spark clusters for multiple data systems. For example, currently there
>are 4 data systems at eBay (real time personalization platform, eBay
>real time ID linking platform, Hadoop, Site speed analytics platform)
>are leveraging Griffin, with more than ~600M records for data quality
>status validation every day, 35 data sets being monitored, 50+ data
>quality models have been created.
>
>As Griffin is designed to connect many types of data sources, we are
>very confident that they will use Griffin as a service for ensuring
>the data quality in open source data ecosystems. We plan to extend and
>diversify this community further through Apache.
>
>Inexperience with Open Source
>
>Griffin's core engineers are all active users and followers of open
>source projects. They are already committers and contributors to the
>Griffin Github project. All have been involved with the source code
>that has been released under an open source license, and several of
>them also have experience developing code in an open source
>environment. Though the core set of Developers do not have Apache Open
>Source experience, there are plans to onboard individuals with Apache
>open source experience on to the project.
>
>Homogenous Developers
>
>The core developers are from eBay. Apache Incubation process
>encourages an open and diverse meritocratic community. Griffin intends
>to make every possible effort to build a diverse, vibrant and involved
>community. We are committed to recruiting additional committers from
>other companies based on their contribution to the project.
>
>Reliance on Salaried Developers
>
>eBay invested in Griffin as a company-wide data quality service
>platform and some of its key engineers are working full time on the
>project. they are all paid by eBay. We look forward to other Apache
>developers and researchers to contribute to the project.
>
>Relationships with Other Apache Products
>
>Griffin has a strong relationship and dependency with Apache Hadoop,
>Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
>Hive. In addition, since there is a growing need for data quality
>solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
>being part of Apache¹s Incubation community, could help with a closer
>collaboration among these four projects and as well as others.
>
>Documentation
>
>Information about Griffin can be found at https://github.com/eBay/griffin
>
>Initial Source
>
>Griffin has been under development since early 2016 by a team of
>engineers at eBay Inc. It is currently hosted on Github.com under an
>Apache license 2.0 at https://github.com/eBay/griffin . Once in
>incubation we will be moving the code base to apache git library.
>
>External Dependencies
>
>Griffin has the following external dependencies.
>
>Basic
>
>JDK 1.7+
>Scala
>Apache Maven
>JUnit
>Log4j
>Slf4j
>Apache Commons
>
>Hadoop
>
>Apache Hadoop
>Apache HBase
>Apache Hive
>
>DB
>
>InfluxData
>
>Apache Spark
>
>Spark Core Library
>
>REST Service
>
>Jersey
>Spring MVC
>
>Web frontend
>
>AngularJS
>jQuery
>Bootstrap
>RequireJS
>eCharts
>Font Awesome
>
>Cryptography
>
>Currently there's no cryptography in Griffin.
>
>Required Resources
>
>Mailing List
>
>We currently use eBay mail box to communicate, but we'd like to move
>that to ASF maintained mailing lists.
>
>Current mailing list: [email protected]
>
>Proposed ASF maintained lists:
>
>[email protected]
>
>[email protected]
>
>[email protected]
>
>Subversion Directory
>
>Git is the preferred source control system.
>
>Issue Tracking
>
>JIRA
>
>Other Resources
>
>The existing code already has unit tests so we will make use of
>existing Apache continuous testing infrastructure. The resulting load
>should not be very large.
>
>Initial Committers
>
>William Go
>Alex Lv
>Vincent Zhao
>Shawn Sha
>John Liu
>Liang Shao
>
>Affiliations
>
>The initial committers are employees of eBay Inc.
>
>Sponsors
>
>Champion
>
>Henry Saputra ([email protected])
>
>Nominated Mentors
>
>Kasper Sørensen ([email protected])
>
>Uma Maheswara Rao Gangumalla ([email protected])
>
>Luciano Resende ([email protected])
>
>Sponsoring Entity
>
>We are requesting the Incubator to sponsor this project.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [email protected]
>For additional commands, e-mail: [email protected]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [VOTE] Bring Griffin to Apache Incubator

Reply via email to