Re: [VOTE] Bring Griffin to Apache Incubator

Jacques Nadeau Thu, 01 Dec 2016 08:48:51 -0800

+1 (binding)

On Thu, Dec 1, 2016 at 8:47 AM, Andrew Purtell <[email protected]>
wrote:


> +1 (binding)
>
> > On Dec 1, 2016, at 8:35 AM, Felix Cheung <[email protected]> wrote:
> >
> > +1
> >
> > On Wed, Nov 30, 2016 at 10:40 PM Henry Saputra <[email protected]>
> > wrote:
> >
> >> Hi All,
> >>
> >> As the champion for Griffin, I would like to start VOTE to bring  the
> >> project as Apache incubator podling.
> >>
> >> Here is the direct quote from the abstract:
> >>
> >> "
> >> Griffin is a Data Quality Service platform built on Apache Hadoop and
> >> Apache Spark. It provides a framework process for defining data
> >> quality model, executing data quality measurement, automating data
> >> profiling and validation, as well as a unified data quality
> >> visualization across multiple data systems. It tries to address the
> >> data quality challenges in big data and streaming context.
> >> "
> >>
> >> Please cast your vote:
> >>
> >> [ ] +1, bring Griffin into Incubator
> >> [ ] +0, I don't care either way,
> >> [ ] -1, do not bring Griffin into Incubator, because...
> >>
> >> This vote will be open at least for 72 hours and only votes from the
> >> Incubator PMC are binding.
> >>
> >> The VOTE will end 12/5 9am PST to pass through weekend.
> >>
> >>
> >> Here is the link to the proposal:
> >>
> >> https://wiki.apache.org/incubator/GriffinProposal
> >>
> >> I have copied the proposal below for easy access
> >>
> >>
> >> Thanks,
> >>
> >> - Henry
> >>
> >>
> >>
> >> Griffin Proposal
> >>
> >> Abstract
> >>
> >> Griffin is a Data Quality Service platform built on Apache Hadoop and
> >> Apache Spark. It provides a framework process for defining data
> >> quality model, executing data quality measurement, automating data
> >> profiling and validation, as well as a unified data quality
> >> visualization across multiple data systems. It tries to address the
> >> data quality challenges in big data and streaming context.
> >>
> >> Proposal
> >>
> >> Griffin is a open source Data Quality solution for distributed data
> >> systems at any scale in both streaming or batch data context. When
> >> people use open source products (e.g. Apache Hadoop, Apache Spark,
> >> Apache Kafka, Apache Storm), they always need a data quality service
> >> to build his/her confidence on data quality processed by those
> >> platforms. Griffin creates a unified process to define and construct
> >> data quality measurement pipeline across multiple data systems to
> >> provide:
> >>
> >> Automatic quality validation of the data
> >> Data profiling and anomaly detection
> >> Data quality lineage from upstream to downstream data systems.
> >> Data quality health monitoring visualization
> >> Shared infrastructure resource management
> >>
> >> Overview of Griffin
> >>
> >> Griffin has been deployed in production at eBay serving major data
> >> systems, it takes a platform approach to provide generic features to
> >> solve common data quality validation pain points. Firstly, user can
> >> register the data asset which user wants to do data quality check. The
> >> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
> >> system or near real-time streaming data from Apache Kafka, Apache
> >> Storm and other real time data platforms. Secondly, user can create
> >> data quality model to define the data quality rule and metadata.
> >> Thirdly, the model or rule will be executed automatically (by the
> >> model engine) to get the sample data quality validation results in a
> >> few seconds for streaming data. Finally, user can analyze the data
> >> quality results through built-in visualization tool to take actions.
> >>
> >> Griffin includes:
> >>
> >> Data Quality Model Engine
> >>
> >> Griffin is model driven solution, user can choose various data quality
> >> dimension to execute his/her data quality validation based on selected
> >> target data-set or source data-set ( as the golden reference data). It
> >> has a corresponding library supporting it in back-end for the
> >> following measurement:
> >>
> >> Accuracy - Does data reflect the real-world objects or a verifiable
> source
> >> Completeness - Is all necessary data present
> >> Validity - Are all data values within the data domains specified by the
> >> business
> >> Timeliness - Is the data available at the time needed
> >> Anomaly detection - Pre-built algorithm functions for the
> >> identification of items, events or observations which do not conform
> >> to an expected pattern or other items in a dataset
> >> Data Profiling - Apply statistical analysis and assessment of data
> >> values within a dataset for consistency, uniqueness and logic.
> >>
> >> Data Collection Layer
> >>
> >> We support two kinds of data sources, batch data and real time data.
> >>
> >> For batch mode, we can collect data source from Apache Hadoop based
> >> platform by various data connectors.
> >>
> >> For real time mode, we can connect with messaging system like Kafka to
> >> near real time analysis.
> >>
> >> Data Process and Storage Layer
> >>
> >> For batch analysis, our data quality model will compute data quality
> >> metrics in our spark cluster based on data source in Apache Hadoop.
> >>
> >> For near real time analysis, we consume data from messaging system,
> >> then our data quality model will compute our real time data quality
> >> metrics in our spark cluster. for data storage, we use time series
> >> database in our back end to fulfill front end request.
> >>
> >> Griffin Service
> >>
> >> We have RESTful web services to accomplish all the functionalities of
> >> Griffin, such as register data asset, create data quality model,
> >> publish metrics, retrieve metrics, add subscription, etc. So, the
> >> developers can develop their own user interface based on these web
> >> services.
> >>
> >> Background
> >>
> >> At eBay, when people play with big data in Apache Hadoop (or other
> >> streaming data), data quality often becomes one big challenge.
> >> Different teams have built customized data quality tools to detect and
> >> analyze data quality issues within their own domain. We are thinking
> >> to take a platform approach to provide shared Infrastructure and
> >> generic features to solve common data quality pain points. This would
> >> enable us to build trusted data assets.
> >>
> >> Currently it’s very difficult and costly to do data quality validation
> >> when we have big data flow across multi-platforms at eBay (e.g.
> >> Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
> >> MongoDB). Take eBay real time personalization platform as an example.
> >> Every day we have to validate data quality status for ~600M records (
> >> imagine we have 150M active users for our website). Data quality often
> >> becomes one big challenge both in its streaming and batch pipelines.
> >>
> >> So we conclude 3 data quality problems at eBay:
> >>
> >> Lack of end2end unified view of data quality measurement from multiple
> >> data sources to target applications, it usually takes a long time to
> >> identify and fix poor data quality.
> >> How to get data quality measured in streaming mode, we need to have a
> >> process and tool to visualize data quality insights through
> >> registering dataset which you want to check data quality, creating
> >> data quality measurement model, executing the data quality validation
> >> job and getting metrics insights for action taking.
> >> No Shared platform and API Service, have to apply and manage own
> >> hardware and software infrastructure.
> >>
> >> Rationale
> >>
> >> The challenge we face at eBay is that our data volume is becoming
> >> bigger and bigger, system processes become more complex, while we do
> >> not have a unified data quality solution to ensure the trusted data
> >> sets which provide confidences on data quality to our data consumers.
> >> The key challenges on data quality includes:
> >>
> >> Existing commercial data quality solution cannot address data quality
> >> lineage among systems, cannot scale out to support fast growing data
> >> at eBay
> >> Existing eBay's domain specific tools take a long time to identify and
> >> fix poor data quality when data flowed through multiple systems
> >> Business logic becomes complex, requires data quality system much
> flexible.
> >>
> >> Some data quality issues do have business impact on user experiences,
> >> revenue, efficiency & compliance.
> >>
> >> Communication overhead of data quality metrics, typically in a big
> >> organization, which involve different teams.
> >>
> >> The idea of Griffin is to provide Data Quality validation as a
> >> Service, to allow data engineers and data consumers to have:
> >>
> >> Near real-time understanding of the data quality health of your data
> >> pipelines with end-to-end monitoring, all in one place.
> >> Profiling, detecting and correlating issues and providing
> >> recommendations that drive rapid and focused troubleshooting
> >> A centralized data quality model management system including rule,
> >> metadata, scheduler etc.
> >> Native code generation to run everywhere, including Hadoop, Kafka,
> Spark,
> >> etc.
> >> One set of tools to build data quality pipelines across all eBay data
> >> platforms.
> >>
> >> Current Status
> >>
> >> Meritocracy
> >>
> >> Griffin has been deployed in production at eBay and provided the
> >> centralized data quality service for several eBay systems ( for
> >> example, real time personalization platform, eBay real time ID linking
> >> platform, Hadoop datasets, Site speed analytics platform). Our aim is
> >> to build a diverse developer and user community following the Apache
> >> meritocracy model. We will encourage contributions and participation
> >> of all types of work, and ensure that contributors are appropriately
> >> recognized.
> >>
> >> Community
> >>
> >> Currently the project is being developed at eBay. It's only for eBay
> >> internal community. Griffin seeks to develop the developer and user
> >> communities during incubation. We believe it will grow substantially
> >> by becoming an Apache project.
> >>
> >> Core Developers
> >>
> >> Griffin is currently being designed and developed by engineers from
> >> eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
> >> All of these core developers have deep expertise in Apache Hadoop and
> >> the Hadoop Ecosystem in general.
> >>
> >> Alignment
> >>
> >> The ASF is a natural host for Griffin given that it is already the
> >> home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
> >> emerging big data products. Those are requiring data quality solution
> >> by nature to ensure the data quality which they processed. When people
> >> use open source data technology, the big question to them is that how
> >> we can ensure the data quality in it. Griffin leverages lot of Apache
> >> open-source products. Griffin was designed to enable real time
> >> insights into data quality validation by shared Infrastructure and
> >> generic features to solve common data quality pain points.
> >>
> >> Known Risks
> >>
> >> Orphaned Products
> >>
> >> The core developers of Griffin team work full time on this project.
> >> There is no risk of Griffin getting orphaned since at least one large
> >> company (eBay) is extensively using it in their production Hadoop and
> >> Spark clusters for multiple data systems. For example, currently there
> >> are 4 data systems at eBay (real time personalization platform, eBay
> >> real time ID linking platform, Hadoop, Site speed analytics platform)
> >> are leveraging Griffin, with more than ~600M records for data quality
> >> status validation every day, 35 data sets being monitored, 50+ data
> >> quality models have been created.
> >>
> >> As Griffin is designed to connect many types of data sources, we are
> >> very confident that they will use Griffin as a service for ensuring
> >> the data quality in open source data ecosystems. We plan to extend and
> >> diversify this community further through Apache.
> >>
> >> Inexperience with Open Source
> >>
> >> Griffin's core engineers are all active users and followers of open
> >> source projects. They are already committers and contributors to the
> >> Griffin Github project. All have been involved with the source code
> >> that has been released under an open source license, and several of
> >> them also have experience developing code in an open source
> >> environment. Though the core set of Developers do not have Apache Open
> >> Source experience, there are plans to onboard individuals with Apache
> >> open source experience on to the project.
> >>
> >> Homogenous Developers
> >>
> >> The core developers are from eBay. Apache Incubation process
> >> encourages an open and diverse meritocratic community. Griffin intends
> >> to make every possible effort to build a diverse, vibrant and involved
> >> community. We are committed to recruiting additional committers from
> >> other companies based on their contribution to the project.
> >>
> >> Reliance on Salaried Developers
> >>
> >> eBay invested in Griffin as a company-wide data quality service
> >> platform and some of its key engineers are working full time on the
> >> project. they are all paid by eBay. We look forward to other Apache
> >> developers and researchers to contribute to the project.
> >>
> >> Relationships with Other Apache Products
> >>
> >> Griffin has a strong relationship and dependency with Apache Hadoop,
> >> Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
> >> Hive. In addition, since there is a growing need for data quality
> >> solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
> >> being part of Apache’s Incubation community, could help with a closer
> >> collaboration among these four projects and as well as others.
> >>
> >> Documentation
> >>
> >> Information about Griffin can be found at https://github.com/eBay/
> griffin
> >>
> >> Initial Source
> >>
> >> Griffin has been under development since early 2016 by a team of
> >> engineers at eBay Inc. It is currently hosted on Github.com under an
> >> Apache license 2.0 at https://github.com/eBay/griffin . Once in
> >> incubation we will be moving the code base to apache git library.
> >>
> >> External Dependencies
> >>
> >> Griffin has the following external dependencies.
> >>
> >> Basic
> >>
> >> JDK 1.7+
> >> Scala
> >> Apache Maven
> >> JUnit
> >> Log4j
> >> Slf4j
> >> Apache Commons
> >>
> >> Hadoop
> >>
> >> Apache Hadoop
> >> Apache HBase
> >> Apache Hive
> >>
> >> DB
> >>
> >> InfluxData
> >>
> >> Apache Spark
> >>
> >> Spark Core Library
> >>
> >> REST Service
> >>
> >> Jersey
> >> Spring MVC
> >>
> >> Web frontend
> >>
> >> AngularJS
> >> jQuery
> >> Bootstrap
> >> RequireJS
> >> eCharts
> >> Font Awesome
> >>
> >> Cryptography
> >>
> >> Currently there's no cryptography in Griffin.
> >>
> >> Required Resources
> >>
> >> Mailing List
> >>
> >> We currently use eBay mail box to communicate, but we'd like to move
> >> that to ASF maintained mailing lists.
> >>
> >> Current mailing list: [email protected]
> >>
> >> Proposed ASF maintained lists:
> >>
> >> [email protected]
> >>
> >> [email protected]
> >>
> >> [email protected]
> >>
> >> Subversion Directory
> >>
> >> Git is the preferred source control system.
> >>
> >> Issue Tracking
> >>
> >> JIRA
> >>
> >> Other Resources
> >>
> >> The existing code already has unit tests so we will make use of
> >> existing Apache continuous testing infrastructure. The resulting load
> >> should not be very large.
> >>
> >> Initial Committers
> >>
> >> William Go
> >> Alex Lv
> >> Vincent Zhao
> >> Shawn Sha
> >> John Liu
> >> Liang Shao
> >>
> >> Affiliations
> >>
> >> The initial committers are employees of eBay Inc.
> >>
> >> Sponsors
> >>
> >> Champion
> >>
> >> Henry Saputra ([email protected])
> >>
> >> Nominated Mentors
> >>
> >> Kasper Sørensen ([email protected])
> >>
> >> Uma Maheswara Rao Gangumalla ([email protected])
> >>
> >> Luciano Resende ([email protected])
> >>
> >> Sponsoring Entity
> >>
> >> We are requesting the Incubator to sponsor this project.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [VOTE] Bring Griffin to Apache Incubator

Reply via email to