+1 (binding) On Thu, Dec 1, 2016 at 8:47 AM, Andrew Purtell <andrew.purt...@gmail.com> wrote:
> +1 (binding) > > > On Dec 1, 2016, at 8:35 AM, Felix Cheung <felixche...@apache.org> wrote: > > > > +1 > > > > On Wed, Nov 30, 2016 at 10:40 PM Henry Saputra <henry.sapu...@gmail.com> > > wrote: > > > >> Hi All, > >> > >> As the champion for Griffin, I would like to start VOTE to bring the > >> project as Apache incubator podling. > >> > >> Here is the direct quote from the abstract: > >> > >> " > >> Griffin is a Data Quality Service platform built on Apache Hadoop and > >> Apache Spark. It provides a framework process for defining data > >> quality model, executing data quality measurement, automating data > >> profiling and validation, as well as a unified data quality > >> visualization across multiple data systems. It tries to address the > >> data quality challenges in big data and streaming context. > >> " > >> > >> Please cast your vote: > >> > >> [ ] +1, bring Griffin into Incubator > >> [ ] +0, I don't care either way, > >> [ ] -1, do not bring Griffin into Incubator, because... > >> > >> This vote will be open at least for 72 hours and only votes from the > >> Incubator PMC are binding. > >> > >> The VOTE will end 12/5 9am PST to pass through weekend. > >> > >> > >> Here is the link to the proposal: > >> > >> https://wiki.apache.org/incubator/GriffinProposal > >> > >> I have copied the proposal below for easy access > >> > >> > >> Thanks, > >> > >> - Henry > >> > >> > >> > >> Griffin Proposal > >> > >> Abstract > >> > >> Griffin is a Data Quality Service platform built on Apache Hadoop and > >> Apache Spark. It provides a framework process for defining data > >> quality model, executing data quality measurement, automating data > >> profiling and validation, as well as a unified data quality > >> visualization across multiple data systems. It tries to address the > >> data quality challenges in big data and streaming context. > >> > >> Proposal > >> > >> Griffin is a open source Data Quality solution for distributed data > >> systems at any scale in both streaming or batch data context. When > >> people use open source products (e.g. Apache Hadoop, Apache Spark, > >> Apache Kafka, Apache Storm), they always need a data quality service > >> to build his/her confidence on data quality processed by those > >> platforms. Griffin creates a unified process to define and construct > >> data quality measurement pipeline across multiple data systems to > >> provide: > >> > >> Automatic quality validation of the data > >> Data profiling and anomaly detection > >> Data quality lineage from upstream to downstream data systems. > >> Data quality health monitoring visualization > >> Shared infrastructure resource management > >> > >> Overview of Griffin > >> > >> Griffin has been deployed in production at eBay serving major data > >> systems, it takes a platform approach to provide generic features to > >> solve common data quality validation pain points. Firstly, user can > >> register the data asset which user wants to do data quality check. The > >> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop > >> system or near real-time streaming data from Apache Kafka, Apache > >> Storm and other real time data platforms. Secondly, user can create > >> data quality model to define the data quality rule and metadata. > >> Thirdly, the model or rule will be executed automatically (by the > >> model engine) to get the sample data quality validation results in a > >> few seconds for streaming data. Finally, user can analyze the data > >> quality results through built-in visualization tool to take actions. > >> > >> Griffin includes: > >> > >> Data Quality Model Engine > >> > >> Griffin is model driven solution, user can choose various data quality > >> dimension to execute his/her data quality validation based on selected > >> target data-set or source data-set ( as the golden reference data). It > >> has a corresponding library supporting it in back-end for the > >> following measurement: > >> > >> Accuracy - Does data reflect the real-world objects or a verifiable > source > >> Completeness - Is all necessary data present > >> Validity - Are all data values within the data domains specified by the > >> business > >> Timeliness - Is the data available at the time needed > >> Anomaly detection - Pre-built algorithm functions for the > >> identification of items, events or observations which do not conform > >> to an expected pattern or other items in a dataset > >> Data Profiling - Apply statistical analysis and assessment of data > >> values within a dataset for consistency, uniqueness and logic. > >> > >> Data Collection Layer > >> > >> We support two kinds of data sources, batch data and real time data. > >> > >> For batch mode, we can collect data source from Apache Hadoop based > >> platform by various data connectors. > >> > >> For real time mode, we can connect with messaging system like Kafka to > >> near real time analysis. > >> > >> Data Process and Storage Layer > >> > >> For batch analysis, our data quality model will compute data quality > >> metrics in our spark cluster based on data source in Apache Hadoop. > >> > >> For near real time analysis, we consume data from messaging system, > >> then our data quality model will compute our real time data quality > >> metrics in our spark cluster. for data storage, we use time series > >> database in our back end to fulfill front end request. > >> > >> Griffin Service > >> > >> We have RESTful web services to accomplish all the functionalities of > >> Griffin, such as register data asset, create data quality model, > >> publish metrics, retrieve metrics, add subscription, etc. So, the > >> developers can develop their own user interface based on these web > >> services. > >> > >> Background > >> > >> At eBay, when people play with big data in Apache Hadoop (or other > >> streaming data), data quality often becomes one big challenge. > >> Different teams have built customized data quality tools to detect and > >> analyze data quality issues within their own domain. We are thinking > >> to take a platform approach to provide shared Infrastructure and > >> generic features to solve common data quality pain points. This would > >> enable us to build trusted data assets. > >> > >> Currently it’s very difficult and costly to do data quality validation > >> when we have big data flow across multi-platforms at eBay (e.g. > >> Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka, > >> MongoDB). Take eBay real time personalization platform as an example. > >> Every day we have to validate data quality status for ~600M records ( > >> imagine we have 150M active users for our website). Data quality often > >> becomes one big challenge both in its streaming and batch pipelines. > >> > >> So we conclude 3 data quality problems at eBay: > >> > >> Lack of end2end unified view of data quality measurement from multiple > >> data sources to target applications, it usually takes a long time to > >> identify and fix poor data quality. > >> How to get data quality measured in streaming mode, we need to have a > >> process and tool to visualize data quality insights through > >> registering dataset which you want to check data quality, creating > >> data quality measurement model, executing the data quality validation > >> job and getting metrics insights for action taking. > >> No Shared platform and API Service, have to apply and manage own > >> hardware and software infrastructure. > >> > >> Rationale > >> > >> The challenge we face at eBay is that our data volume is becoming > >> bigger and bigger, system processes become more complex, while we do > >> not have a unified data quality solution to ensure the trusted data > >> sets which provide confidences on data quality to our data consumers. > >> The key challenges on data quality includes: > >> > >> Existing commercial data quality solution cannot address data quality > >> lineage among systems, cannot scale out to support fast growing data > >> at eBay > >> Existing eBay's domain specific tools take a long time to identify and > >> fix poor data quality when data flowed through multiple systems > >> Business logic becomes complex, requires data quality system much > flexible. > >> > >> Some data quality issues do have business impact on user experiences, > >> revenue, efficiency & compliance. > >> > >> Communication overhead of data quality metrics, typically in a big > >> organization, which involve different teams. > >> > >> The idea of Griffin is to provide Data Quality validation as a > >> Service, to allow data engineers and data consumers to have: > >> > >> Near real-time understanding of the data quality health of your data > >> pipelines with end-to-end monitoring, all in one place. > >> Profiling, detecting and correlating issues and providing > >> recommendations that drive rapid and focused troubleshooting > >> A centralized data quality model management system including rule, > >> metadata, scheduler etc. > >> Native code generation to run everywhere, including Hadoop, Kafka, > Spark, > >> etc. > >> One set of tools to build data quality pipelines across all eBay data > >> platforms. > >> > >> Current Status > >> > >> Meritocracy > >> > >> Griffin has been deployed in production at eBay and provided the > >> centralized data quality service for several eBay systems ( for > >> example, real time personalization platform, eBay real time ID linking > >> platform, Hadoop datasets, Site speed analytics platform). Our aim is > >> to build a diverse developer and user community following the Apache > >> meritocracy model. We will encourage contributions and participation > >> of all types of work, and ensure that contributors are appropriately > >> recognized. > >> > >> Community > >> > >> Currently the project is being developed at eBay. It's only for eBay > >> internal community. Griffin seeks to develop the developer and user > >> communities during incubation. We believe it will grow substantially > >> by becoming an Apache project. > >> > >> Core Developers > >> > >> Griffin is currently being designed and developed by engineers from > >> eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu. > >> All of these core developers have deep expertise in Apache Hadoop and > >> the Hadoop Ecosystem in general. > >> > >> Alignment > >> > >> The ASF is a natural host for Griffin given that it is already the > >> home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other > >> emerging big data products. Those are requiring data quality solution > >> by nature to ensure the data quality which they processed. When people > >> use open source data technology, the big question to them is that how > >> we can ensure the data quality in it. Griffin leverages lot of Apache > >> open-source products. Griffin was designed to enable real time > >> insights into data quality validation by shared Infrastructure and > >> generic features to solve common data quality pain points. > >> > >> Known Risks > >> > >> Orphaned Products > >> > >> The core developers of Griffin team work full time on this project. > >> There is no risk of Griffin getting orphaned since at least one large > >> company (eBay) is extensively using it in their production Hadoop and > >> Spark clusters for multiple data systems. For example, currently there > >> are 4 data systems at eBay (real time personalization platform, eBay > >> real time ID linking platform, Hadoop, Site speed analytics platform) > >> are leveraging Griffin, with more than ~600M records for data quality > >> status validation every day, 35 data sets being monitored, 50+ data > >> quality models have been created. > >> > >> As Griffin is designed to connect many types of data sources, we are > >> very confident that they will use Griffin as a service for ensuring > >> the data quality in open source data ecosystems. We plan to extend and > >> diversify this community further through Apache. > >> > >> Inexperience with Open Source > >> > >> Griffin's core engineers are all active users and followers of open > >> source projects. They are already committers and contributors to the > >> Griffin Github project. All have been involved with the source code > >> that has been released under an open source license, and several of > >> them also have experience developing code in an open source > >> environment. Though the core set of Developers do not have Apache Open > >> Source experience, there are plans to onboard individuals with Apache > >> open source experience on to the project. > >> > >> Homogenous Developers > >> > >> The core developers are from eBay. Apache Incubation process > >> encourages an open and diverse meritocratic community. Griffin intends > >> to make every possible effort to build a diverse, vibrant and involved > >> community. We are committed to recruiting additional committers from > >> other companies based on their contribution to the project. > >> > >> Reliance on Salaried Developers > >> > >> eBay invested in Griffin as a company-wide data quality service > >> platform and some of its key engineers are working full time on the > >> project. they are all paid by eBay. We look forward to other Apache > >> developers and researchers to contribute to the project. > >> > >> Relationships with Other Apache Products > >> > >> Griffin has a strong relationship and dependency with Apache Hadoop, > >> Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache > >> Hive. In addition, since there is a growing need for data quality > >> solution for open source platform (e.g. Hadoop, Kafka, Spark etc), > >> being part of Apache’s Incubation community, could help with a closer > >> collaboration among these four projects and as well as others. > >> > >> Documentation > >> > >> Information about Griffin can be found at https://github.com/eBay/ > griffin > >> > >> Initial Source > >> > >> Griffin has been under development since early 2016 by a team of > >> engineers at eBay Inc. It is currently hosted on Github.com under an > >> Apache license 2.0 at https://github.com/eBay/griffin . Once in > >> incubation we will be moving the code base to apache git library. > >> > >> External Dependencies > >> > >> Griffin has the following external dependencies. > >> > >> Basic > >> > >> JDK 1.7+ > >> Scala > >> Apache Maven > >> JUnit > >> Log4j > >> Slf4j > >> Apache Commons > >> > >> Hadoop > >> > >> Apache Hadoop > >> Apache HBase > >> Apache Hive > >> > >> DB > >> > >> InfluxData > >> > >> Apache Spark > >> > >> Spark Core Library > >> > >> REST Service > >> > >> Jersey > >> Spring MVC > >> > >> Web frontend > >> > >> AngularJS > >> jQuery > >> Bootstrap > >> RequireJS > >> eCharts > >> Font Awesome > >> > >> Cryptography > >> > >> Currently there's no cryptography in Griffin. > >> > >> Required Resources > >> > >> Mailing List > >> > >> We currently use eBay mail box to communicate, but we'd like to move > >> that to ASF maintained mailing lists. > >> > >> Current mailing list: ebay-griffin-d...@googlegroups.com > >> > >> Proposed ASF maintained lists: > >> > >> priv...@griffin.incubator.apache.org > >> > >> d...@griffin.incubator.apache.org > >> > >> comm...@griffin.incubator.apache.org > >> > >> Subversion Directory > >> > >> Git is the preferred source control system. > >> > >> Issue Tracking > >> > >> JIRA > >> > >> Other Resources > >> > >> The existing code already has unit tests so we will make use of > >> existing Apache continuous testing infrastructure. The resulting load > >> should not be very large. > >> > >> Initial Committers > >> > >> William Go > >> Alex Lv > >> Vincent Zhao > >> Shawn Sha > >> John Liu > >> Liang Shao > >> > >> Affiliations > >> > >> The initial committers are employees of eBay Inc. > >> > >> Sponsors > >> > >> Champion > >> > >> Henry Saputra (hsapu...@apache.org) > >> > >> Nominated Mentors > >> > >> Kasper Sørensen (kasper...@apache.org) > >> > >> Uma Maheswara Rao Gangumalla (umamah...@apache.org) > >> > >> Luciano Resende (luckbr1...@gmail.com) > >> > >> Sponsoring Entity > >> > >> We are requesting the Incubator to sponsor this project. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > >> For additional commands, e-mail: general-h...@incubator.apache.org > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >