It makes sense. I will try to contribute on this ;)

Regards
JB

On 10/19/2015 09:46 PM, Zhang, Edward (GDI Hadoop) wrote:
Hi JB,

That is a good Point. Good to know that Falcon feeds HDFS/Hive/HBase data
changes, so this feature would complement Eagle which today mainly focuses
on HDFS/Hive/HBase data access including view, change, delete etc. Eagle
would benefit if Eagle can instantly capture data change from Falcon.

Thanks
Edward Zhang



On 10/19/15, 8:40, "Jean-Baptiste Onofré" <j...@nanthrax.net> wrote:

Hi Arun,

very interesting proposal. I may see some possible interaction with
Falcon. In Falcon, we have HDFS files (and Hive/HBase) monitoring (with
a kind of Change Data Capture), etc.

So, I see a different perspective in Eagle, but Eagle could also
leverage Falcon somehow.

Regards
JB

On 10/19/2015 05:33 PM, Manoharan, Arun wrote:
Hello Everyone,

My name is Arun Manoharan. Currently a product manager in the Analytics
platform team at eBay Inc.

I would like to start a discussion on Eagle and its joining the ASF as
an incubation project.

Eagle is a Monitoring solution for Hadoop to instantly identify access
to sensitive data, recognize attacks, malicious activities and take
actions in real time. Eagle supports a wide variety of policies on HDFS
data and Hive. Eagle also provides machine learning models for detecting
anomalous user behavior in Hadoop.

The proposal is available on the wiki here:
https://wiki.apache.org/incubator/EagleProposal

The text of the proposal is also available at the end of this email.

Thanks for your time and help.

Thanks,
Arun

<COPY of the proposal in text format>

Eagle

Abstract
Eagle is an Open Source Monitoring solution for Hadoop to instantly
identify access to sensitive data, recognize attacks, malicious
activities in hadoop and take actions.

Proposal
Eagle audits access to HDFS files, Hive and HBase tables in real time,
enforces policies defined on sensitive data access and alerts or blocks
user¹s access to that sensitive data in real time. Eagle also creates
user profiles based on the typical access behaviour for HDFS and Hive
and sends alerts when anomalous behaviour is detected. Eagle can also
import sensitive data information classified by external classification
engines to help define its policies.

Overview of Eagle
Eagle has 3 main parts.
1.Data collection and storage - Eagle collects data from various hadoop
logs in real time using Kafka/Yarn API and uses HDFS and HBase for
storage.
2.Data processing and policy engine - Eagle allows users to create
policies based on various metadata properties on HDFS, Hive and HBase
data.
3.Eagle services - Eagle services include policy manager, query service
and the visualization component. Eagle provides intuitive user interface
to administer Eagle and an alert dashboard to respond to real time
alerts.

Data Collection and Storage:
Eagle provides programming API for extending Eagle to integrate any
data source into Eagle policy evaluation framework. For example, Eagle
hdfs audit monitoring collects data from Kafka which is populated from
namenode log4j appender or from logstash agent. Eagle hive monitoring
collects hive query logs from running job through YARN API, which is
designed to be scalable and fault-tolerant. Eagle uses HBase as storage
for storing metadata and metrics data, and also supports relational
database through configuration change.

Data Processing and Policy Engine:
Processing Engine: Eagle provides stream processing API which is an
abstraction of Apache Storm. It can also be extended to other streaming
engines. This abstraction allows developers to assemble data
transformation, filtering, external data join etc. without physically
bound to a specific streaming platform. Eagle streaming API allows
developers to easily integrate business logic with Eagle policy engine
and internally Eagle framework compiles business logic execution DAG
into program primitives of underlying stream infrastructure e.g. Apache
Storm. For example, Eagle HDFS monitoring transforms audit log from
Namenode to object and joins sensitivity metadata, security zone
metadata which are generated from external programs or configured by
user. Eagle hive monitoring filters running jobs to get hive query
string and parses query string into object and then joins sensitivity
metadata.
Alerting Framework: Eagle Alert Framework includes stream metadata API,
scalable policy engine framework, extensible policy engine framework.
Stream metadata API allows developers to declare event schema including
what attributes constitute an event, what is the type for each
attribute, and how to dynamically resolve attribute value in runtime
when user configures policy. Scalable policy engine framework allows
policies to be executed on different physical nodes in parallel. It is
also used to define your own policy partitioner class. Policy engine
framework together with streaming partitioning capability provided by
all streaming platforms will make sure policies and events can be
evaluated in a fully distributed way. Extensible policy engine framework
allows developer to plugin a new policy engine with a few lines of
codes. WSO2 Siddhi CEP engine is the policy engine which Eagle supports
as first-class citizen.
Machine Learning module: Eagle provides capabilities to define user
activity patterns or user profiles for Hadoop users based on the user
behaviour in the platform. These user profiles are modeled using Machine
Learning algorithms and used for detection of anomalous users
activities. Eagle uses Eigen Value Decomposition, and Density Estimation
algorithms for generating user profile models. The model reads data from
HDFS audit logs, preprocesses and aggregates data, and generates models
using Spark programming APIs. Once models are generated, Eagle uses
stream processing engine for near real-time anomaly detection to
determine if any user¹s activities are suspicious or not.

Eagle Services:
Query Service: Eagle provides SQL-like service API to support
comprehensive computation for huge set of data on the fly, for e.g.
comprehensive filtering, aggregation, histogram, sorting, top,
arithmetical expression, pagination etc. HBase is the data storage which
Eagle supports as first-class citizen, relational database is supported
as well. For HBase storage, Eagle query framework compiles user provided
SQL-like query into HBase native filter objects and execute it through
HBase coprocessor on the fly.
Policy Manager: Eagle policy manager provides UI and Restful API for
user to define policy with just a few clicks. It includes site
management UI, policy editor, sensitivity metadata import, HDFS or Hive
sensitive resource browsing, alert dashboards etc.
Background
Data is one of the most important assets for today¹s businesses, which
makes data security one of the top priorities of today¹s enterprises.
Hadoop is widely used across different verticals as a big data
repository to store this data in most modern enterprises.
At eBay we use hadoop platform extensively for our data processing
needs. Our data in Hadoop is becoming bigger and bigger as our user base
is seeing an exponential growth. Today there are variety of data sets
available in Hadoop cluster for our users to consume. eBay has around
120 PB of data stored in HDFS across 6 different clusters and around
1800+ active hadoop users consuming data thru Hive, HBase and mapreduce
jobs everyday to build applications using this data. With this
astronomical growth of data there are also challenges in securing
sensitive data and monitoring the access to this sensitive data. Today
in large organizations HDFS is the defacto standard for storing big
data. Data sets which includes and not limited to consumer sentiment,
social media data, customer segmentation, web clicks, sensor data,
geo-location and transaction data get stored in Hadoop for day to day
business needs.
We at eBay want to make sure the sensitive data and data platforms are
completely protected from security breaches. So we partnered very
closely with our Information Security team to understand the
requirements for Eagle to monitor sensitive data access on hadoop:
1.Ability to identify and stop security threats in real time
2.Scale for big data (Support PB scale and Billions of events)
3.Ability to create data access policies
4.Support multiple data sources like HDFS, HBase, Hive
5.Visualize alerts in real time
6.Ability to block malicious access in real time
We did not find any data access monitoring solution that available
today and can provide the features and functionality that we need to
monitor the data access in the hadoop ecosystem at our scale. Hence with
an excellent team of world class developers and several users, we have
been able to bring Eagle into production as well as open source it.

Rationale
In today¹s world; data is an important asset for any company.
Businesses are using data extensively to create amazing experiences for
users. Data has to be protected and access to data should be secured
>from security breaches. Today Hadoop is not only used to store logs but
also stores financial data, sensitive data sets, geographical data, user
click stream data sets etc. which makes it more important to be
protected from security breaches. To secure a data platform there are
multiple things that need to happen. One is having a strong access
control mechanism which today is provided by Apache Ranger and Apache
Sentry. These tools provide the ability to provide fine grain access
control mechanism to data sets on hadoop. But there is a big gap in
terms of monitoring all the data access events and activities in order
to securing the hadoop data platform. Together with strong access
control, perimeter security and data access monitoring in place data in
the hadoop clusters can be secu
r
ed against breaches. We looked around and found following:
Existing data activity monitoring products are designed for traditional
databases and data warehouse. Existing monitoring platforms cannot scale
out to support fast growing data and petabyte scale. Few products in the
industry are still very early in terms of supporting HDFS, Hive, HBase
data access monitoring.
As mentioned in the background, the business requirement and urgency to
secure the data from users with malicious intent drove eBay to invest in
building a real time data access monitoring solution from scratch to
offer real time alerts and remediation features for malicious data
access.
With the power of open source distributed systems like Hadoop, Kafka
and much more we were able to develop a data activity monitoring system
that can scale, identify and stop malicious access in real time.
Eagle allows admins to create standard access policies and rules for
monitoring HDFS, Hive and HBase data. Eagle also provides out of box
machine learning models for modeling user profiles based on user access
behaviour and use the model to alert on anomalies.

Current Status

Meritocracy
Eagle has been deployed in production at eBay for monitoring billions
of events per day from HDFS and Hive operations. From the start; the
product has been built with focus on high scalability and application
extensibility in mind and Eagle has demonstrated great performance in
responding to suspicious events instantly and great flexibility in
defining policy.

Community
Eagle seeks to develop the developer and user communities during
incubation.

Core Developers
Eagle is currently being designed and developed by engineers from eBay
Inc. ­ Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang,
Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of
these core developers have deep expertise in developing monitoring
products for the Hadoop ecosystem.

Alignment
The ASF is a natural host for Eagle given that it is already the home
of Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data
projects. Eagle leverages lot of Apache open-source products. Eagle was
designed to offer real time insights into sensitive data access by
actively monitoring the data access on various data sets in hadoop and
an extensible alerting framework with a powerful policy engine. Eagle
compliments the existing Hadoop platform area by providing a
comprehensive monitoring and alerting solution for detecting sensitive
data access threats based on preset policies and machine learning models
for user behaviour analysis.

Known Risks

Orphaned Products
The core developers of Eagle team work full time on this project. There
is no risk of Eagle getting orphaned since eBay is extensively using it
in their production Hadoop clusters and have plans to go beyond hadoop.
For example, currently there are 7 hadoop clusters and 2 of them are
being monitored using Hadoop Eagle in production. We have plans to
extend it to all hadoop clusters and eventually other data platforms.
There are 10¹s of policies onboarded and actively monitored with plans
to onboard more use case. We are very confident that every hadoop
cluster in the world will be monitored using Eagle for securing the
hadoop ecosystem by actively monitoring for data access on sensitive
data. We plan to extend and diversify this community further through
Apache. We presented Eagle at the hadoop summit in china and garnered
interest from different companies who use hadoop extensively.

Inexperience with Open Source
The core developers are all active users and followers of open source.
They are already committers and contributors to the Eagle Github
project. All have been involved with the source code that has been
released under an open source license, and several of them also have
experience developing code in an open source environment. Though the
core set of Developers do not have Apache Open Source experience, there
are plans to onboard individuals with Apache open source experience on
to the project. Apache Kylin PMC members are also in the same ebay
organization. We work very closely with Apache Ranger committers and are
looking forward to find meaningful integrations to improve the security
of hadoop platform.

Homogenous Developers
The core developers are from eBay. Today the problem of monitoring data
activities to find and stop threats is a universal problem faced by all
the businesses. Apache Incubation process encourages an open and diverse
meritocratic community. Eagle intends to make every possible effort to
build a diverse, vibrant and involved community and has already received
substantial interest from various organizations.

Reliance on Salaried Developers
eBay invested in Eagle as the monitoring solution for Hadoop clusters
and some of its key engineers are working full time on the project. In
addition, since there is a growing need for securing sensitive data
access we need a data activity monitoring solution for Hadoop, we look
forward to other Apache developers and researchers to contribute to the
project. Additional contributors, including Apache committers have plans
to join this effort shortly. Also key to addressing the risk associated
with relying on Salaried developers from a single entity is to increase
the diversity of the contributors and actively lobby for Domain experts
in the security space to contribute. Eagle intends to do this.

Relationships with Other Apache Products
Eagle has a strong relationship and dependency with Apache Hadoop,
HBase, Spark, Kafka and Storm. Being part of Apache¹s Incubation
community, could help with a closer collaboration among these projects
and as well as others. An Excessive Fascination with the Apache Brand
Eagle is proposing to enter incubation at Apache in order to help
efforts to diversify the committer-base, not so much to capitalize on
the Apache brand. The Eagle project is in production use already inside
eBay, but is not expected to be an eBay product for external customers.
As such, the Eagle project is not seeking to use the Apache brand as a
marketing tool.

Documentation
Information about Eagle can be found at https://github.com/eBay/Eagle.
The following link provide more information about Eagle
http://goeagle.io.

Initial Source
Eagle has been under development since 2014 by a team of engineers at
eBay Inc. It is currently hosted on Github.com under an Apache license
2.0 at https://github.com/eBay/Eagle. Once in incubation we will be
moving the code base to apache git library.

External Dependencies
Eagle has the following external dependencies.
Basic
€JDK 1.7+
€Scala 2.10.4
€Apache Maven
€JUnit
€Log4j
€Slf4j
€Apache Commons
€Apache Commons Math3
€Jackson
€Siddhi CEP engine

Hadoop
€Apache Hadoop
€Apache HBase
€Apache Hive
€Apache Zookeeper
€Apache Curator

Apache Spark
€Spark Core Library

REST Service
€Jersey

Query
€Antlr

Stream processing
€Apache Storm
€Apache Kafka

Web
€AngularJS
€jQuery
€Bootstrap V3
€Moment JS
€Admin LTE
€html5shiv
€respond
€Fastclick
€Date Range Picker
€Flot JS

Cryptography
Eagle will eventually support encryption on the wire. This is not one
of the initial goals, and we do not expect Eagle to be a controlled
export item due to the use of encryption. Eagle supports but does not
require the Kerberos authentication mechanism to access secured Hadoop
services.

Required Resources

Mailing List
€eagle-private for private PMC discussions
€eagle-dev for developers
€eagle-commits for all commits
€eagle-users for all eagle users

Subversion Directory
€Git is the preferred source control system.

Issue Tracking
€JIRA Eagle (Eagle)

Other Resources
The existing code already has unit tests so we will make use of
existing Apache continuous testing infrastructure. The resulting load
should not be very large.

Initial Committers
€Seshu Adunuthula <sadunuthula at ebay dot com>
€Arun Manoharan <armanoharan at ebay dot com>
€Edward Zhang <yonzhang at ebay dot com>
€Hao Chen <hchen9 at ebay dot com>
€Chaitali Gupta <cgupta at ebay dot com>
€Libin Sun <libsun at ebay dot com>
€Jilin Jiang <jiljiang at ebay dot com>
€Qingwen Zhao <qingwzhao at ebay dot com>
€Hemanth Dendukuri <hdendukuri at ebay dot com>
€Senthil Kumar <senthilkumar at ebay dot com>
€Tan Chen <tanchen at ebay dot com>

Affiliations
The initial committers are employees of eBay Inc.

Sponsors

Champion
€Henry Saputra <hsaputra at apache dot org> - Apache IPMC member

Nominated Mentors
€Owen O¹Malley < omalley at apache dot org > - Apache IPMC member,
Hortonworks
€Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
€Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member,
Hortonworks

Sponsoring Entity
We are requesting the Incubator to sponsor this project.




--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Reply via email to