I would suggest that Owen O'Malley has not had enough time to be a viable mentor recently and should not be on the list of mentors.
Henry and Julian are good if their schedules permit. Henry, I know has been mentoring a number of projects lately. On Mon, Oct 19, 2015 at 8:40 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi Arun, > > very interesting proposal. I may see some possible interaction with > Falcon. In Falcon, we have HDFS files (and Hive/HBase) monitoring (with a > kind of Change Data Capture), etc. > > So, I see a different perspective in Eagle, but Eagle could also leverage > Falcon somehow. > > Regards > JB > > > On 10/19/2015 05:33 PM, Manoharan, Arun wrote: > >> Hello Everyone, >> >> My name is Arun Manoharan. Currently a product manager in the Analytics >> platform team at eBay Inc. >> >> I would like to start a discussion on Eagle and its joining the ASF as an >> incubation project. >> >> Eagle is a Monitoring solution for Hadoop to instantly identify access to >> sensitive data, recognize attacks, malicious activities and take actions in >> real time. Eagle supports a wide variety of policies on HDFS data and Hive. >> Eagle also provides machine learning models for detecting anomalous user >> behavior in Hadoop. >> >> The proposal is available on the wiki here: >> https://wiki.apache.org/incubator/EagleProposal >> >> The text of the proposal is also available at the end of this email. >> >> Thanks for your time and help. >> >> Thanks, >> Arun >> >> <COPY of the proposal in text format> >> >> Eagle >> >> Abstract >> Eagle is an Open Source Monitoring solution for Hadoop to instantly >> identify access to sensitive data, recognize attacks, malicious activities >> in hadoop and take actions. >> >> Proposal >> Eagle audits access to HDFS files, Hive and HBase tables in real time, >> enforces policies defined on sensitive data access and alerts or blocks >> user’s access to that sensitive data in real time. Eagle also creates user >> profiles based on the typical access behaviour for HDFS and Hive and sends >> alerts when anomalous behaviour is detected. Eagle can also import >> sensitive data information classified by external classification engines to >> help define its policies. >> >> Overview of Eagle >> Eagle has 3 main parts. >> 1.Data collection and storage - Eagle collects data from various hadoop >> logs in real time using Kafka/Yarn API and uses HDFS and HBase for storage. >> 2.Data processing and policy engine - Eagle allows users to create >> policies based on various metadata properties on HDFS, Hive and HBase data. >> 3.Eagle services - Eagle services include policy manager, query service >> and the visualization component. Eagle provides intuitive user interface to >> administer Eagle and an alert dashboard to respond to real time alerts. >> >> Data Collection and Storage: >> Eagle provides programming API for extending Eagle to integrate any data >> source into Eagle policy evaluation framework. For example, Eagle hdfs >> audit monitoring collects data from Kafka which is populated from namenode >> log4j appender or from logstash agent. Eagle hive monitoring collects hive >> query logs from running job through YARN API, which is designed to be >> scalable and fault-tolerant. Eagle uses HBase as storage for storing >> metadata and metrics data, and also supports relational database through >> configuration change. >> >> Data Processing and Policy Engine: >> Processing Engine: Eagle provides stream processing API which is an >> abstraction of Apache Storm. It can also be extended to other streaming >> engines. This abstraction allows developers to assemble data >> transformation, filtering, external data join etc. without physically bound >> to a specific streaming platform. Eagle streaming API allows developers to >> easily integrate business logic with Eagle policy engine and internally >> Eagle framework compiles business logic execution DAG into program >> primitives of underlying stream infrastructure e.g. Apache Storm. For >> example, Eagle HDFS monitoring transforms audit log from Namenode to object >> and joins sensitivity metadata, security zone metadata which are generated >> from external programs or configured by user. Eagle hive monitoring filters >> running jobs to get hive query string and parses query string into object >> and then joins sensitivity metadata. >> Alerting Framework: Eagle Alert Framework includes stream metadata API, >> scalable policy engine framework, extensible policy engine framework. >> Stream metadata API allows developers to declare event schema including >> what attributes constitute an event, what is the type for each attribute, >> and how to dynamically resolve attribute value in runtime when user >> configures policy. Scalable policy engine framework allows policies to be >> executed on different physical nodes in parallel. It is also used to define >> your own policy partitioner class. Policy engine framework together with >> streaming partitioning capability provided by all streaming platforms will >> make sure policies and events can be evaluated in a fully distributed way. >> Extensible policy engine framework allows developer to plugin a new policy >> engine with a few lines of codes. WSO2 Siddhi CEP engine is the policy >> engine which Eagle supports as first-class citizen. >> Machine Learning module: Eagle provides capabilities to define user >> activity patterns or user profiles for Hadoop users based on the user >> behaviour in the platform. These user profiles are modeled using Machine >> Learning algorithms and used for detection of anomalous users activities. >> Eagle uses Eigen Value Decomposition, and Density Estimation algorithms for >> generating user profile models. The model reads data from HDFS audit logs, >> preprocesses and aggregates data, and generates models using Spark >> programming APIs. Once models are generated, Eagle uses stream processing >> engine for near real-time anomaly detection to determine if any user’s >> activities are suspicious or not. >> >> Eagle Services: >> Query Service: Eagle provides SQL-like service API to support >> comprehensive computation for huge set of data on the fly, for e.g. >> comprehensive filtering, aggregation, histogram, sorting, top, arithmetical >> expression, pagination etc. HBase is the data storage which Eagle supports >> as first-class citizen, relational database is supported as well. For HBase >> storage, Eagle query framework compiles user provided SQL-like query into >> HBase native filter objects and execute it through HBase coprocessor on the >> fly. >> Policy Manager: Eagle policy manager provides UI and Restful API for user >> to define policy with just a few clicks. It includes site management UI, >> policy editor, sensitivity metadata import, HDFS or Hive sensitive resource >> browsing, alert dashboards etc. >> Background >> Data is one of the most important assets for today’s businesses, which >> makes data security one of the top priorities of today’s enterprises. >> Hadoop is widely used across different verticals as a big data repository >> to store this data in most modern enterprises. >> At eBay we use hadoop platform extensively for our data processing needs. >> Our data in Hadoop is becoming bigger and bigger as our user base is seeing >> an exponential growth. Today there are variety of data sets available in >> Hadoop cluster for our users to consume. eBay has around 120 PB of data >> stored in HDFS across 6 different clusters and around 1800+ active hadoop >> users consuming data thru Hive, HBase and mapreduce jobs everyday to build >> applications using this data. With this astronomical growth of data there >> are also challenges in securing sensitive data and monitoring the access to >> this sensitive data. Today in large organizations HDFS is the defacto >> standard for storing big data. Data sets which includes and not limited to >> consumer sentiment, social media data, customer segmentation, web clicks, >> sensor data, geo-location and transaction data get stored in Hadoop for day >> to day business needs. >> We at eBay want to make sure the sensitive data and data platforms are >> completely protected from security breaches. So we partnered very closely >> with our Information Security team to understand the requirements for Eagle >> to monitor sensitive data access on hadoop: >> 1.Ability to identify and stop security threats in real time >> 2.Scale for big data (Support PB scale and Billions of events) >> 3.Ability to create data access policies >> 4.Support multiple data sources like HDFS, HBase, Hive >> 5.Visualize alerts in real time >> 6.Ability to block malicious access in real time >> We did not find any data access monitoring solution that available today >> and can provide the features and functionality that we need to monitor the >> data access in the hadoop ecosystem at our scale. Hence with an excellent >> team of world class developers and several users, we have been able to >> bring Eagle into production as well as open source it. >> >> Rationale >> In today’s world; data is an important asset for any company. Businesses >> are using data extensively to create amazing experiences for users. Data >> has to be protected and access to data should be secured from security >> breaches. Today Hadoop is not only used to store logs but also stores >> financial data, sensitive data sets, geographical data, user click stream >> data sets etc. which makes it more important to be protected from security >> breaches. To secure a data platform there are multiple things that need to >> happen. One is having a strong access control mechanism which today is >> provided by Apache Ranger and Apache Sentry. These tools provide the >> ability to provide fine grain access control mechanism to data sets on >> hadoop. But there is a big gap in terms of monitoring all the data access >> events and activities in order to securing the hadoop data platform. >> Together with strong access control, perimeter security and data access >> monitoring in place data in the hadoop clusters can be secu >> > r > ed against breaches. We looked around and found following: > >> Existing data activity monitoring products are designed for traditional >> databases and data warehouse. Existing monitoring platforms cannot scale >> out to support fast growing data and petabyte scale. Few products in the >> industry are still very early in terms of supporting HDFS, Hive, HBase data >> access monitoring. >> As mentioned in the background, the business requirement and urgency to >> secure the data from users with malicious intent drove eBay to invest in >> building a real time data access monitoring solution from scratch to offer >> real time alerts and remediation features for malicious data access. >> With the power of open source distributed systems like Hadoop, Kafka and >> much more we were able to develop a data activity monitoring system that >> can scale, identify and stop malicious access in real time. >> Eagle allows admins to create standard access policies and rules for >> monitoring HDFS, Hive and HBase data. Eagle also provides out of box >> machine learning models for modeling user profiles based on user access >> behaviour and use the model to alert on anomalies. >> >> Current Status >> >> Meritocracy >> Eagle has been deployed in production at eBay for monitoring billions of >> events per day from HDFS and Hive operations. From the start; the product >> has been built with focus on high scalability and application extensibility >> in mind and Eagle has demonstrated great performance in responding to >> suspicious events instantly and great flexibility in defining policy. >> >> Community >> Eagle seeks to develop the developer and user communities during >> incubation. >> >> Core Developers >> Eagle is currently being designed and developed by engineers from eBay >> Inc. – Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang, >> Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of >> these core developers have deep expertise in developing monitoring products >> for the Hadoop ecosystem. >> >> Alignment >> The ASF is a natural host for Eagle given that it is already the home of >> Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data >> projects. Eagle leverages lot of Apache open-source products. Eagle was >> designed to offer real time insights into sensitive data access by actively >> monitoring the data access on various data sets in hadoop and an extensible >> alerting framework with a powerful policy engine. Eagle compliments the >> existing Hadoop platform area by providing a comprehensive monitoring and >> alerting solution for detecting sensitive data access threats based on >> preset policies and machine learning models for user behaviour analysis. >> >> Known Risks >> >> Orphaned Products >> The core developers of Eagle team work full time on this project. There >> is no risk of Eagle getting orphaned since eBay is extensively using it in >> their production Hadoop clusters and have plans to go beyond hadoop. For >> example, currently there are 7 hadoop clusters and 2 of them are being >> monitored using Hadoop Eagle in production. We have plans to extend it to >> all hadoop clusters and eventually other data platforms. There are 10’s of >> policies onboarded and actively monitored with plans to onboard more use >> case. We are very confident that every hadoop cluster in the world will be >> monitored using Eagle for securing the hadoop ecosystem by actively >> monitoring for data access on sensitive data. We plan to extend and >> diversify this community further through Apache. We presented Eagle at the >> hadoop summit in china and garnered interest from different companies who >> use hadoop extensively. >> >> Inexperience with Open Source >> The core developers are all active users and followers of open source. >> They are already committers and contributors to the Eagle Github project. >> All have been involved with the source code that has been released under an >> open source license, and several of them also have experience developing >> code in an open source environment. Though the core set of Developers do >> not have Apache Open Source experience, there are plans to onboard >> individuals with Apache open source experience on to the project. Apache >> Kylin PMC members are also in the same ebay organization. We work very >> closely with Apache Ranger committers and are looking forward to find >> meaningful integrations to improve the security of hadoop platform. >> >> Homogenous Developers >> The core developers are from eBay. Today the problem of monitoring data >> activities to find and stop threats is a universal problem faced by all the >> businesses. Apache Incubation process encourages an open and diverse >> meritocratic community. Eagle intends to make every possible effort to >> build a diverse, vibrant and involved community and has already received >> substantial interest from various organizations. >> >> Reliance on Salaried Developers >> eBay invested in Eagle as the monitoring solution for Hadoop clusters and >> some of its key engineers are working full time on the project. In >> addition, since there is a growing need for securing sensitive data access >> we need a data activity monitoring solution for Hadoop, we look forward to >> other Apache developers and researchers to contribute to the project. >> Additional contributors, including Apache committers have plans to join >> this effort shortly. Also key to addressing the risk associated with >> relying on Salaried developers from a single entity is to increase the >> diversity of the contributors and actively lobby for Domain experts in the >> security space to contribute. Eagle intends to do this. >> >> Relationships with Other Apache Products >> Eagle has a strong relationship and dependency with Apache Hadoop, HBase, >> Spark, Kafka and Storm. Being part of Apache’s Incubation community, could >> help with a closer collaboration among these projects and as well as >> others. An Excessive Fascination with the Apache Brand Eagle is proposing >> to enter incubation at Apache in order to help efforts to diversify the >> committer-base, not so much to capitalize on the Apache brand. The Eagle >> project is in production use already inside eBay, but is not expected to be >> an eBay product for external customers. As such, the Eagle project is not >> seeking to use the Apache brand as a marketing tool. >> >> Documentation >> Information about Eagle can be found at https://github.com/eBay/Eagle. >> The following link provide more information about Eagle http://goeagle.io >> . >> >> Initial Source >> Eagle has been under development since 2014 by a team of engineers at >> eBay Inc. It is currently hosted on Github.com under an Apache license 2.0 >> at https://github.com/eBay/Eagle. Once in incubation we will be moving >> the code base to apache git library. >> >> External Dependencies >> Eagle has the following external dependencies. >> Basic >> •JDK 1.7+ >> •Scala 2.10.4 >> •Apache Maven >> •JUnit >> •Log4j >> •Slf4j >> •Apache Commons >> •Apache Commons Math3 >> •Jackson >> •Siddhi CEP engine >> >> Hadoop >> •Apache Hadoop >> •Apache HBase >> •Apache Hive >> •Apache Zookeeper >> •Apache Curator >> >> Apache Spark >> •Spark Core Library >> >> REST Service >> •Jersey >> >> Query >> •Antlr >> >> Stream processing >> •Apache Storm >> •Apache Kafka >> >> Web >> •AngularJS >> •jQuery >> •Bootstrap V3 >> •Moment JS >> •Admin LTE >> •html5shiv >> •respond >> •Fastclick >> •Date Range Picker >> •Flot JS >> >> Cryptography >> Eagle will eventually support encryption on the wire. This is not one of >> the initial goals, and we do not expect Eagle to be a controlled export >> item due to the use of encryption. Eagle supports but does not require the >> Kerberos authentication mechanism to access secured Hadoop services. >> >> Required Resources >> >> Mailing List >> •eagle-private for private PMC discussions >> •eagle-dev for developers >> •eagle-commits for all commits >> •eagle-users for all eagle users >> >> Subversion Directory >> •Git is the preferred source control system. >> >> Issue Tracking >> •JIRA Eagle (Eagle) >> >> Other Resources >> The existing code already has unit tests so we will make use of existing >> Apache continuous testing infrastructure. The resulting load should not be >> very large. >> >> Initial Committers >> •Seshu Adunuthula <sadunuthula at ebay dot com> >> •Arun Manoharan <armanoharan at ebay dot com> >> •Edward Zhang <yonzhang at ebay dot com> >> •Hao Chen <hchen9 at ebay dot com> >> •Chaitali Gupta <cgupta at ebay dot com> >> •Libin Sun <libsun at ebay dot com> >> •Jilin Jiang <jiljiang at ebay dot com> >> •Qingwen Zhao <qingwzhao at ebay dot com> >> •Hemanth Dendukuri <hdendukuri at ebay dot com> >> •Senthil Kumar <senthilkumar at ebay dot com> >> •Tan Chen <tanchen at ebay dot com> >> >> Affiliations >> The initial committers are employees of eBay Inc. >> >> Sponsors >> >> Champion >> •Henry Saputra <hsaputra at apache dot org> - Apache IPMC member >> >> Nominated Mentors >> •Owen O’Malley < omalley at apache dot org > - Apache IPMC member, >> Hortonworks >> •Henry Saputra <hsaputra at apache dot org> - Apache IPMC member >> •Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member, >> Hortonworks >> >> Sponsoring Entity >> We are requesting the Incubator to sponsor this project. >> >> >> >> > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >