+1 for moving forward with a VOTE.

> On Oct 22, 2015, at 7:26 PM, Henry Saputra <henry.sapu...@gmail.com> wrote:
> 
> Looks like the discussion has calm down, so unless there is more
> comments we will send VOTE thread tomorrow.
> 
> Thanks all for the feedback.
> 
> - Henry
> 
>> On Mon, Oct 19, 2015 at 8:33 AM, Manoharan, Arun <armanoha...@ebay.com> 
>> wrote:
>> Hello Everyone,
>> 
>> My name is Arun Manoharan. Currently a product manager in the Analytics 
>> platform team at eBay Inc.
>> 
>> I would like to start a discussion on Eagle and its joining the ASF as an 
>> incubation project.
>> 
>> Eagle is a Monitoring solution for Hadoop to instantly identify access to 
>> sensitive data, recognize attacks, malicious activities and take actions in 
>> real time. Eagle supports a wide variety of policies on HDFS data and Hive. 
>> Eagle also provides machine learning models for detecting anomalous user 
>> behavior in Hadoop.
>> 
>> The proposal is available on the wiki here:
>> https://wiki.apache.org/incubator/EagleProposal
>> 
>> The text of the proposal is also available at the end of this email.
>> 
>> Thanks for your time and help.
>> 
>> Thanks,
>> Arun
>> 
>> <COPY of the proposal in text format>
>> 
>> Eagle
>> 
>> Abstract
>> Eagle is an Open Source Monitoring solution for Hadoop to instantly identify 
>> access to sensitive data, recognize attacks, malicious activities in hadoop 
>> and take actions.
>> 
>> Proposal
>> Eagle audits access to HDFS files, Hive and HBase tables in real time, 
>> enforces policies defined on sensitive data access and alerts or blocks 
>> user’s access to that sensitive data in real time. Eagle also creates user 
>> profiles based on the typical access behaviour for HDFS and Hive and sends 
>> alerts when anomalous behaviour is detected. Eagle can also import sensitive 
>> data information classified by external classification engines to help 
>> define its policies.
>> 
>> Overview of Eagle
>> Eagle has 3 main parts.
>> 1.Data collection and storage - Eagle collects data from various hadoop logs 
>> in real time using Kafka/Yarn API and uses HDFS and HBase for storage.
>> 2.Data processing and policy engine - Eagle allows users to create policies 
>> based on various metadata properties on HDFS, Hive and HBase data.
>> 3.Eagle services - Eagle services include policy manager, query service and 
>> the visualization component. Eagle provides intuitive user interface to 
>> administer Eagle and an alert dashboard to respond to real time alerts.
>> 
>> Data Collection and Storage:
>> Eagle provides programming API for extending Eagle to integrate any data 
>> source into Eagle policy evaluation framework. For example, Eagle hdfs audit 
>> monitoring collects data from Kafka which is populated from namenode log4j 
>> appender or from logstash agent. Eagle hive monitoring collects hive query 
>> logs from running job through YARN API, which is designed to be scalable and 
>> fault-tolerant. Eagle uses HBase as storage for storing metadata and metrics 
>> data, and also supports relational database through configuration change.
>> 
>> Data Processing and Policy Engine:
>> Processing Engine: Eagle provides stream processing API which is an 
>> abstraction of Apache Storm. It can also be extended to other streaming 
>> engines. This abstraction allows developers to assemble data transformation, 
>> filtering, external data join etc. without physically bound to a specific 
>> streaming platform. Eagle streaming API allows developers to easily 
>> integrate business logic with Eagle policy engine and internally Eagle 
>> framework compiles business logic execution DAG into program primitives of 
>> underlying stream infrastructure e.g. Apache Storm. For example, Eagle HDFS 
>> monitoring transforms audit log from Namenode to object and joins 
>> sensitivity metadata, security zone metadata which are generated from 
>> external programs or configured by user. Eagle hive monitoring filters 
>> running jobs to get hive query string and parses query string into object 
>> and then joins sensitivity metadata.
>> Alerting Framework: Eagle Alert Framework includes stream metadata API, 
>> scalable policy engine framework, extensible policy engine framework. Stream 
>> metadata API allows developers to declare event schema including what 
>> attributes constitute an event, what is the type for each attribute, and how 
>> to dynamically resolve attribute value in runtime when user configures 
>> policy. Scalable policy engine framework allows policies to be executed on 
>> different physical nodes in parallel. It is also used to define your own 
>> policy partitioner class. Policy engine framework together with streaming 
>> partitioning capability provided by all streaming platforms will make sure 
>> policies and events can be evaluated in a fully distributed way. Extensible 
>> policy engine framework allows developer to plugin a new policy engine with 
>> a few lines of codes. WSO2 Siddhi CEP engine is the policy engine which 
>> Eagle supports as first-class citizen.
>> Machine Learning module: Eagle provides capabilities to define user activity 
>> patterns or user profiles for Hadoop users based on the user behaviour in 
>> the platform. These user profiles are modeled using Machine Learning 
>> algorithms and used for detection of anomalous users activities. Eagle uses 
>> Eigen Value Decomposition, and Density Estimation algorithms for generating 
>> user profile models. The model reads data from HDFS audit logs, preprocesses 
>> and aggregates data, and generates models using Spark programming APIs. Once 
>> models are generated, Eagle uses stream processing engine for near real-time 
>> anomaly detection to determine if any user’s activities are suspicious or 
>> not.
>> 
>> Eagle Services:
>> Query Service: Eagle provides SQL-like service API to support comprehensive 
>> computation for huge set of data on the fly, for e.g. comprehensive 
>> filtering, aggregation, histogram, sorting, top, arithmetical expression, 
>> pagination etc. HBase is the data storage which Eagle supports as 
>> first-class citizen, relational database is supported as well. For HBase 
>> storage, Eagle query framework compiles user provided SQL-like query into 
>> HBase native filter objects and execute it through HBase coprocessor on the 
>> fly.
>> Policy Manager: Eagle policy manager provides UI and Restful API for user to 
>> define policy with just a few clicks. It includes site management UI, policy 
>> editor, sensitivity metadata import, HDFS or Hive sensitive resource 
>> browsing, alert dashboards etc.
>> Background
>> Data is one of the most important assets for today’s businesses, which makes 
>> data security one of the top priorities of today’s enterprises. Hadoop is 
>> widely used across different verticals as a big data repository to store 
>> this data in most modern enterprises.
>> At eBay we use hadoop platform extensively for our data processing needs. 
>> Our data in Hadoop is becoming bigger and bigger as our user base is seeing 
>> an exponential growth. Today there are variety of data sets available in 
>> Hadoop cluster for our users to consume. eBay has around 120 PB of data 
>> stored in HDFS across 6 different clusters and around 1800+ active hadoop 
>> users consuming data thru Hive, HBase and mapreduce jobs everyday to build 
>> applications using this data. With this astronomical growth of data there 
>> are also challenges in securing sensitive data and monitoring the access to 
>> this sensitive data. Today in large organizations HDFS is the defacto 
>> standard for storing big data. Data sets which includes and not limited to 
>> consumer sentiment, social media data, customer segmentation, web clicks, 
>> sensor data, geo-location and transaction data get stored in Hadoop for day 
>> to day business needs.
>> We at eBay want to make sure the sensitive data and data platforms are 
>> completely protected from security breaches. So we partnered very closely 
>> with our Information Security team to understand the requirements for Eagle 
>> to monitor sensitive data access on hadoop:
>> 1.Ability to identify and stop security threats in real time
>> 2.Scale for big data (Support PB scale and Billions of events)
>> 3.Ability to create data access policies
>> 4.Support multiple data sources like HDFS, HBase, Hive
>> 5.Visualize alerts in real time
>> 6.Ability to block malicious access in real time
>> We did not find any data access monitoring solution that available today and 
>> can provide the features and functionality that we need to monitor the data 
>> access in the hadoop ecosystem at our scale. Hence with an excellent team of 
>> world class developers and several users, we have been able to bring Eagle 
>> into production as well as open source it.
>> 
>> Rationale
>> In today’s world; data is an important asset for any company. Businesses are 
>> using data extensively to create amazing experiences for users. Data has to 
>> be protected and access to data should be secured from security breaches. 
>> Today Hadoop is not only used to store logs but also stores financial data, 
>> sensitive data sets, geographical data, user click stream data sets etc. 
>> which makes it more important to be protected from security breaches. To 
>> secure a data platform there are multiple things that need to happen. One is 
>> having a strong access control mechanism which today is provided by Apache 
>> Ranger and Apache Sentry. These tools provide the ability to provide fine 
>> grain access control mechanism to data sets on hadoop. But there is a big 
>> gap in terms of monitoring all the data access events and activities in 
>> order to securing the hadoop data platform. Together with strong access 
>> control, perimeter security and data access monitoring in place data in the 
>> hadoop clusters can be secured against breaches. We looked around and found 
>> following:
>> Existing data activity monitoring products are designed for traditional 
>> databases and data warehouse. Existing monitoring platforms cannot scale out 
>> to support fast growing data and petabyte scale. Few products in the 
>> industry are still very early in terms of supporting HDFS, Hive, HBase data 
>> access monitoring.
>> As mentioned in the background, the business requirement and urgency to 
>> secure the data from users with malicious intent drove eBay to invest in 
>> building a real time data access monitoring solution from scratch to offer 
>> real time alerts and remediation features for malicious data access.
>> With the power of open source distributed systems like Hadoop, Kafka and 
>> much more we were able to develop a data activity monitoring system that can 
>> scale, identify and stop malicious access in real time.
>> Eagle allows admins to create standard access policies and rules for 
>> monitoring HDFS, Hive and HBase data. Eagle also provides out of box machine 
>> learning models for modeling user profiles based on user access behaviour 
>> and use the model to alert on anomalies.
>> 
>> Current Status
>> 
>> Meritocracy
>> Eagle has been deployed in production at eBay for monitoring billions of 
>> events per day from HDFS and Hive operations. From the start; the product 
>> has been built with focus on high scalability and application extensibility 
>> in mind and Eagle has demonstrated great performance in responding to 
>> suspicious events instantly and great flexibility in defining policy.
>> 
>> Community
>> Eagle seeks to develop the developer and user communities during incubation.
>> 
>> Core Developers
>> Eagle is currently being designed and developed by engineers from eBay Inc. 
>> – Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang, Qingwen 
>> Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of these core 
>> developers have deep expertise in developing monitoring products for the 
>> Hadoop ecosystem.
>> 
>> Alignment
>> The ASF is a natural host for Eagle given that it is already the home of 
>> Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data 
>> projects. Eagle leverages lot of Apache open-source products. Eagle was 
>> designed to offer real time insights into sensitive data access by actively 
>> monitoring the data access on various data sets in hadoop and an extensible 
>> alerting framework with a powerful policy engine. Eagle compliments the 
>> existing Hadoop platform area by providing a comprehensive monitoring and 
>> alerting solution for detecting sensitive data access threats based on 
>> preset policies and machine learning models for user behaviour analysis.
>> 
>> Known Risks
>> 
>> Orphaned Products
>> The core developers of Eagle team work full time on this project. There is 
>> no risk of Eagle getting orphaned since eBay is extensively using it in 
>> their production Hadoop clusters and have plans to go beyond hadoop. For 
>> example, currently there are 7 hadoop clusters and 2 of them are being 
>> monitored using Hadoop Eagle in production. We have plans to extend it to 
>> all hadoop clusters and eventually other data platforms. There are 10’s of 
>> policies onboarded and actively monitored with plans to onboard more use 
>> case. We are very confident that every hadoop cluster in the world will be 
>> monitored using Eagle for securing the hadoop ecosystem by actively 
>> monitoring for data access on sensitive data. We plan to extend and 
>> diversify this community further through Apache. We presented Eagle at the 
>> hadoop summit in china and garnered interest from different companies who 
>> use hadoop extensively.
>> 
>> Inexperience with Open Source
>> The core developers are all active users and followers of open source. They 
>> are already committers and contributors to the Eagle Github project. All 
>> have been involved with the source code that has been released under an open 
>> source license, and several of them also have experience developing code in 
>> an open source environment. Though the core set of Developers do not have 
>> Apache Open Source experience, there are plans to onboard individuals with 
>> Apache open source experience on to the project. Apache Kylin PMC members 
>> are also in the same ebay organization. We work very closely with Apache 
>> Ranger committers and are looking forward to find meaningful integrations to 
>> improve the security of hadoop platform.
>> 
>> Homogenous Developers
>> The core developers are from eBay. Today the problem of monitoring data 
>> activities to find and stop threats is a universal problem faced by all the 
>> businesses. Apache Incubation process encourages an open and diverse 
>> meritocratic community. Eagle intends to make every possible effort to build 
>> a diverse, vibrant and involved community and has already received 
>> substantial interest from various organizations.
>> 
>> Reliance on Salaried Developers
>> eBay invested in Eagle as the monitoring solution for Hadoop clusters and 
>> some of its key engineers are working full time on the project. In addition, 
>> since there is a growing need for securing sensitive data access we need a 
>> data activity monitoring solution for Hadoop, we look forward to other 
>> Apache developers and researchers to contribute to the project. Additional 
>> contributors, including Apache committers have plans to join this effort 
>> shortly. Also key to addressing the risk associated with relying on Salaried 
>> developers from a single entity is to increase the diversity of the 
>> contributors and actively lobby for Domain experts in the security space to 
>> contribute. Eagle intends to do this.
>> 
>> Relationships with Other Apache Products
>> Eagle has a strong relationship and dependency with Apache Hadoop, HBase, 
>> Spark, Kafka and Storm. Being part of Apache’s Incubation community, could 
>> help with a closer collaboration among these projects and as well as others. 
>> An Excessive Fascination with the Apache Brand Eagle is proposing to enter 
>> incubation at Apache in order to help efforts to diversify the 
>> committer-base, not so much to capitalize on the Apache brand. The Eagle 
>> project is in production use already inside eBay, but is not expected to be 
>> an eBay product for external customers. As such, the Eagle project is not 
>> seeking to use the Apache brand as a marketing tool.
>> 
>> Documentation
>> Information about Eagle can be found at https://github.com/eBay/Eagle. The 
>> following link provide more information about Eagle http://goeagle.io.
>> 
>> Initial Source
>> Eagle has been under development since 2014 by a team of engineers at eBay 
>> Inc. It is currently hosted on Github.com under an Apache license 2.0 at 
>> https://github.com/eBay/Eagle. Once in incubation we will be moving the code 
>> base to apache git library.
>> 
>> External Dependencies
>> Eagle has the following external dependencies.
>> Basic
>> •JDK 1.7+
>> •Scala 2.10.4
>> •Apache Maven
>> •JUnit
>> •Log4j
>> •Slf4j
>> •Apache Commons
>> •Apache Commons Math3
>> •Jackson
>> •Siddhi CEP engine
>> 
>> Hadoop
>> •Apache Hadoop
>> •Apache HBase
>> •Apache Hive
>> •Apache Zookeeper
>> •Apache Curator
>> 
>> Apache Spark
>> •Spark Core Library
>> 
>> REST Service
>> •Jersey
>> 
>> Query
>> •Antlr
>> 
>> Stream processing
>> •Apache Storm
>> •Apache Kafka
>> 
>> Web
>> •AngularJS
>> •jQuery
>> •Bootstrap V3
>> •Moment JS
>> •Admin LTE
>> •html5shiv
>> •respond
>> •Fastclick
>> •Date Range Picker
>> •Flot JS
>> 
>> Cryptography
>> Eagle will eventually support encryption on the wire. This is not one of the 
>> initial goals, and we do not expect Eagle to be a controlled export item due 
>> to the use of encryption. Eagle supports but does not require the Kerberos 
>> authentication mechanism to access secured Hadoop services.
>> 
>> Required Resources
>> 
>> Mailing List
>> •eagle-private for private PMC discussions
>> •eagle-dev for developers
>> •eagle-commits for all commits
>> •eagle-users for all eagle users
>> 
>> Subversion Directory
>> •Git is the preferred source control system.
>> 
>> Issue Tracking
>> •JIRA Eagle (Eagle)
>> 
>> Other Resources
>> The existing code already has unit tests so we will make use of existing 
>> Apache continuous testing infrastructure. The resulting load should not be 
>> very large.
>> 
>> Initial Committers
>> •Seshu Adunuthula <sadunuthula at ebay dot com>
>> •Arun Manoharan <armanoharan at ebay dot com>
>> •Edward Zhang <yonzhang at ebay dot com>
>> •Hao Chen <hchen9 at ebay dot com>
>> •Chaitali Gupta <cgupta at ebay dot com>
>> •Libin Sun <libsun at ebay dot com>
>> •Jilin Jiang <jiljiang at ebay dot com>
>> •Qingwen Zhao <qingwzhao at ebay dot com>
>> •Hemanth Dendukuri <hdendukuri at ebay dot com>
>> •Senthil Kumar <senthilkumar at ebay dot com>
>> •Tan Chen <tanchen at ebay dot com>
>> 
>> Affiliations
>> The initial committers are employees of eBay Inc.
>> 
>> Sponsors
>> 
>> Champion
>> •Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>> 
>> Nominated Mentors
>> •Owen O’Malley < omalley at apache dot org > - Apache IPMC member, 
>> Hortonworks
>> •Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>> •Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member, Hortonworks
>> 
>> Sponsoring Entity
>> We are requesting the Incubator to sponsor this project.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Reply via email to