Re: [DISCUSS] Accept Science Data Analytics Platform (SDAP) into Apache Incubator

Raphael Bircher Thu, 12 Oct 2017 13:20:48 -0700

Hi all

I offer my mentorship


Regards, Raphael
Am .10.2017, 21:31 Uhr, schrieb Tom Barber <t...@spicule.co.uk>:

Interesting stuff!

On Thu, Oct 12, 2017 at 8:00 PM, Huang, Thomas (398G) <
thomas.hu...@jpl.nasa.gov> wrote:

Thanks Chris.  We are learning from the master ;-)

Big thanks to Lewis for getting us here!


Thomas.

--
Thomas Huang
Jet Propulsion Laboratory
4800 Oak Grove Drive, Mail Stop 158-242, Pasadena, CA 91109
Phone: 818.354.2747, Email: thomas.hu...@jpl.nasa.gov

DISCLAIMER: All personal and professional opinions presented herein aremyown and do not, in any way, represent the opinion or policy of JPL,NASA or

Caltech.


On 10/12/17, 10:29 AM, "Chris Mattmann" <mattm...@apache.org> wrote:

Very interesting project Lewis and Thomas and glad to see thiscoming

to the Incubator!



    On 10/11/17, 11:22 AM, "lewis john mcgibbney" <lewi...@apache.org>
wrote:

        Hi Folks,
        I would like to open a DISCUSS thread on the topic of accepting
the Science
        Data Analytics Platform (SDAP) <
        https://wiki.apache.org/incubator/SDAPProposal> Project into the
Incubator.
        I am CC'ing Thomas Huang from NASA JPL who I have been working
with to
        build community around a kick-ass set of software projects under
the SDAP
        umbrella.

At this stage we would very much appreciate critical feedbackfrom

general@

community. We are also open to mentors who may have an interestin

the
        project proposal.
        The proposal is pasted below.
        Thanks in advance,
        Lewis

        = Abstract =
        The Science Data Analytics Platform (SDAP) establishes an
integrated data

analytic center for Big Science problems. It focuses ontechnology

        integration, advancement and maturity.

        = Proposal =
        SDAP currently represents a collaboration between NASA Jet
Propulsion
        Laboratory (JPL), Florida State University (FSU), the National
Center for
        Atmospheric Research (NCAR), and George Mason University (GMU).
SDAP brings

together a number of big data technologies including a NASAfunded

        OceanXtremes (Anomaly detection and ocean science), NEXUS (Deep
data
        analytic platform), DOMS (Distributed in-situ to satellite
matchup), MUDROD
        (Search relevancy and discovery) and VQSS (Virtualized Quality
Screening
        Service) under a single umbrella. Within the original Incubator
proposal,

VQSS will not be included however it is anticipated that afuture

source
        code donation will cover VQSS.

        = Background and Rationale =

SDAP is a technology software solution currently geared tobetter

enable

scientists involved in advancing the study of the Earth'sphysical

        oceanography. With increasing global temperature, warming of the
ocean, and

melting ice sheets and glaciers, the impacts can be observedfrom

changes
        in anomalous ocean temperature and circulation patterns, to
increasing

extreme weather events and stronger/more frequent hurricanes,sea

level
        rise and storm surges affecting coastlines, and may involve
drastic changes
        and shifts in marine ecosystems. Ocean science communities are
relying on
        data distributed through data centers such as the JPL's Physical

Oceanographic Data Active Archive Center (PO.DAAC) to conducttheir

        research. In typical investigations, oceanographers follow a
traditional

workflow for using datasets: search, evaluate, download, andapply

tools
        and algorithms to look for trends. While this workflow has been
working

very well historically for the oceanographic community, itcannot

scale if

the research involves massive amount of data. NASA's SurfaceWater

and
        Ocean Topography (SWOT) mission, scheduled to launch in April of
2021, is

expected to generate over 20PB data for a nominal 3-yearmission.

This will
        challenge all existing NASA Earth Science data
archival/distribution
        paradigms. It will no longer be feasible for Earth scientists to
download
        and analyze such volumes of data. SDAP was therefore developed
primarily as
        a Web-service platform for big ocean data science at the PO.DAAC
with open
        source solutions used to enable fast analysis of oceanographic
data. SDAP

has been developed collaboratively between JPL, FSU, NCAR, andGMU

and is
        rapidly maturing to become the generic platform for the next
generation of
        big science data solutions. The platform is an orchestration of
several
        previously funded NASA big ocean data solutions using cloud
technology,
        which include data analysis (NEXUS), anomaly detection
(OceanXtremes),

matchup (DOMS), subsetting, discovery (MUDROD), andvisualization

(VQSS).
        SDAP will enable web-accessible, fast data analysis directly on
huge
        scientific data archives to minimize data movement and provide
access,
        including subset, only to the relevant data.

        = Science Data Analytics Platform Project Overview =

SDAP consists of several loosely coupled, independentlyfunctioningsub-projects. The graphic below displays an overview of howthese

        sub-projects fuse together. N.B., although the graphic uses
terminology
        relating to OceanWorks, essentially the SDAP architecture is
identical.

        {{attachment:sdap.png}}

        == OceanXtremes ==
        Oceanographic Data-Intensive Anomaly Detection and Analysis
Portal. An
        application that allows you to view imagery and perform analysis
on sea
        level rise data.

        '''Objective'''

Develop an anomaly detection system which identifies items,events

or
        observations which do not conform to an expected pattern.

* Mature and test domain-specific, multi-scale anomaly andfeature

        detection algorithms.

* Identify unexpected correlations between key measuredvariables.


        Demonstrate value of technologies in this service:
         * Adapted Map-Reduce data mining.
         * Algorithm profiling service.
         * Shared discovery and exploration search tools.
         * Automatic notification of events of interest.

        == NEXUS ==
        NEXUS is an emerging technology developed at JPL

* A Cloud-based/Cluster-based data platform that performsscalable

        handling of observational parameters analysis designed to scale
horizontally
         * Leveraging high-performance indexed, temporal, and geospatial
search
        solution
         * Breaks data products into small chunks and stores them in a
Cloud-based
        data store

        ''Data Volumes Exploding''
         * SWOT mission is coming
         * File I/O is slow

        ''Scalable Store & Compute is Available''
         * NoSQL cluster databases
         * Parallel compute, in-memory map-reduce
         * Bring Compute to Highly-Accessible Data (using Hybrid Cloud)

        ''Pre-Chunk and Summarize Key Variables''
         * Easy statistics instantly (milliseconds)
         * Harder statistics on-demand (in seconds)
         * Visualize original data (layers) on a map quickly

        == DOMS ==
        The Distributed Oceanographic Match-Up Service
        DOMS is designed to reconcile satellite and in situ datasets in
support of

NASA's Earth Science mission. The service will provide amechanism

for
        users to input a series of geospatial references for satellite
observations
        and receive the in situ observations that are matched to the
satellite data
        within a selectable temporal and spatial domain. DOMS includes
several
        characteristic in situ and satellite observation datasets - with
an initial
        focus on salinity, sea temperature, and winds. DOMS will be used
by the
        marine and satellite research communities to support a range of
activities
        and several use cases will be described. The service is designed
to provide

a community-accessible tool that dynamically delivers matcheddata

and
        allows the scientist to only work with the subset of data where
the matches
        exist.

        == MUDROD ==

Mining and Utilizing Dataset Relevancy from OceanographicDatasets

to
        Improve Data Discovery and Access
        Data discovery accuracy is a challenging topic for both Earth
science and

other domains. It is especially true for scientific data setsthat

are not
        as popular as Amazon or Google data. MUDROD is focused on mining
oceanic

knowledge from the PO.DAAC user log files to improve the enduser

data
        discovery experience at PO.DAAC. There are three steps in the
research: a)

the oceanographic semantics were extracted from three resourcesof

SWEET,
        GCMD ontology, and the keywords used by end users for searching
PO.DAAC

datasets, b) mining the linkage among different vocabulariesbased

on user
        data discvoery sessions, and c) build the linkage among
vocabularies based
        on a comprehensive approach by considering domain de facto
standard, e.g.,
        SWEET and GCMD, and the knowledge mined from the log files. The
semantics

is used to improve data discovery for ranking results,navigating

among
        vocabularies, and recommending data based on user searchers.

        = Current Status =
        All components of SDAP were originally designed and developed
under grants

from the NASA-funded Advanced Information Systems andTechnologies

(AIST)
        program. The initiative to bring them the components together
under the
        SDAP umbrella was granted through an AIST-funded follow-on grant
which will
        run for another ~18 or so months.
        Currently no projects have made official releases so outside of
community
        building, this will be our primary Incubating goal. All SDAP
source code is
        currently publicly available and licensed under the ALv2.0.

        = Meritocracy =

The current developers are familiar with meritocratic opensource

        development at Apache. The SDAP team consumes Apache products
heavily with

members being part of several Apache user communities. SDAPitself

has
        critical dependencies upon Apache products. Lewis McGibbney (JPL
employee),
        a Member of the ASF and V.P. of Apache Any23, Gora PMC Nutch,
Tika, OODT,
        OCW, etc., is championing the effort to bring SDAP into and
through the
        Apache Incubator and has been evangelizing the Apache Way to the
current
        SDAP contributors such that the meritocratic process is well
understood and
        followed. Apache was chosen specifically because we want to
encourage this
        style of community development for the project and for it to
sustain SDAP

forward to become the generic platform for the next generationof

big
        science data solutions

        = Community =
        The SDAP project is a fairly new effort and our community is not
yet
        fully/firmly established. Initial committers comprising the SDAP
roster
        have only recently fully come together as a unified team however
there is a
        large degree of synergy between constituent members at JPL, FSU,
NCAR, and

GMU. Therefore, community building and publicity continues tobe a

major
        thrust. With the activity and exposure regularly attained by
several

community members, we hope to grow the SDAP presence in andacross

several
        (scientific) forums. The SDAP technology is generating interest
within
        communities such as the Earth Science Information Partnership
(ESIP),

American Geophysical Union (AGU) and plethora or sciencemeetings

around
        the globe. This in effect, we hope, will further contribute
towards the

possibility of SDAP being used across Government Agencies suchas

NASA,

NOAA, USGS, EPA, DOI, etc. as well as by researchers andstudents

in
        academic institutions around the globe.
        During incubation, we will explicitly seek to increase our
adoption, with

SDAP already being featured on the agenda for several highprofile

globally
        significant scientific conferences and meetings.

        = Core Developers =

The current set of core developers is relatively small,including

full-time

and students from across JPL, FSU, NCAR, and GMU. Initialcommunitymanagement and participation will be distributed across theentire

team,

most of which have been involved with the constituent projectsfor

<2
        years.

        = Alignment =
        All SDAP code is licensed under Apache v2.0.

        = Known Risks =

        == Orphaned products ==
        There are currently no orphaned products. Each component of SDAP
has
        dedicated personnel leading and participating in its ongoing
development.

Additionally, there is substantial collaboration betweenprojectsfacilitated by regular project meetings which are specific thethe

initial
        member entities and focused on advancing physical oceanographic
science.

        == Inexperience with Open Source ==
        JPL (in particular Lewis McGibbney) has been part of several
efforts to

transition to and grow projects communities at Apache e.g.Apache

OODT,

Apache Open Climate Workbench, Apache Joshua (Incubating),Apache

SensSoft

(Incubating), Apache DRAT (Incubating). Most of the codedeveloped

under
        the SDAP umbrella was and is open source prior to the Incubator
effort so

we are well familiarized with the nuances of open sourcesoftware.


        = Relationships with Other Apache Products =
        SDAP has strong dependency upon a number of high profile and
smaller

profile Apache products. Examples can be seen in the breakdownof

External

Dependencies. As we continue to grow SDAP within the Incubator,we

will

make efforts to share community stories, software advancementsand

possible
        improvements in our use of our Apache dependencies back to those
project
        communities.

        = Developers =

The SDAP project and hence developers is currently fundedthrough

a NASA

AIST follow-on grant with funding secured for the next ~18months.

There

are currently no 100% time dedicated developers, however, thesame

core

team that does work currently will continue to work on theproject

        throughout the next current funding period and after. There is
currently no
        business strategy aligned with SDAP however it is perceived that
future,
        yet unsecured funding may by directed to further feature
advancement and
        project evangelism.

        = Documentation =

Documentation is currently available in a number of locationse.g.

Github
        wiki, Github pages, etc. with each repository under the
oceanworks-aist
        Github Org maintaining documentation available through wiki’s
attached to

the repositories. Additionally, most of the SDAP sub-projectshave

been
        extensively documented within plethora of formal academic
publications
        across several academic communities. It would be our intention,
certainly
        atleast to unify the Github wiki ad Github pages documentation
most likely
        to make up the sdap.apache.org Website content.

        = Initial Source =
        Current source resides in several locations Github:
         * https://github.com/dataplumber/nexus (NEXUS, OceanXtremes,
DOMS)
         * https://github.com/dataplumber/edge (EDGE)
         * https://github.com/aist-oceanworks/mudrod (MUDROD)
         * https://bitbucket.org/coaps_mdc/doms/src (DOMS)

        = External Dependencies =

Each component of the Science Data Analytics Platform has itsowndependencies. Documentation will be available for integratingthem.


        == MUDROD ==
        '''Core'''
        com.google.code.gson gson 2.5 compile
        jar false
        org.jdom jdom 2.0.2 compile
        jar false
        org.elasticsearch elasticsearch 5.2.0 compile
        jar false
        org.elasticsearch elasticsearch-spark-20_2.11 5.2.0 compile
        jar false
        joda-time joda-time 2.9.4 compile
        jar false
        com.carrotsearch hppc 0.7.1 compile
        jar false
        org.apache.spark spark-core_2.11 2.1.0 compile
        jar false
        org.apache.spark spark-sql_2.11 2.1.0 compile
        jar false
        org.apache.spark spark-mllib_2.11 2.1.0 compile
        jar false
        org.scala-lang scala-library 2.11.8 compile
        jar false
        org.codehaus.jettison jettison 1.3.8 compile
        jar false
        commons-cli commons-cli 1.2 compile
        jar false
        net.sf.opencsv opencsv 2.3 compile
        jar false
        org.apache.jena jena-core 3.3.0 compile
        jar false
        junit junit 4.12 test
        jar false

        '''Service'''
        gov.nasa.jpl.mudrod mudrod-core 0.0.1-SNAPSHOT compile
        jar false
        javax.servlet javax.servlet-api 3.1.0 provided
        jar false
        com.google.code.gson gson 2.5 compile
        jar false

        '''Web'''
         * AngularJS - MIT License
         * BootstrapJS - MIT License
         * jQueryJS - MIT License
         * Underscore JS - MIT License

        == DOMS ==
         * Apache Solr version 5.5.1http://lucene.apache.org/solr/
         * EDGE https://github.com/dataplumber/edge
         * NetCDF4 http://unidata.github.io/netcdf4-python/
         * Python 3.5 (NOTE: only partial support for py2.7)

        Non stdlib Python dependencies:
         * Jinja2==2.9.5
         * python-dateutil==2.6.0
         * cython==0.25.2
         * numpy==1.12.0
         * scipy==0.18.1
         * netCDF4==1.2.7
         * solrpy3
         * siphon==0.4.0
         * neo4j-driver==1.1.0
         * matplotlib==2.0.0
         * requests==2.13.0
         * shapely==1.5.17
         * flask==0.12
         * networkx==1.11
         * pyproj==1.9.5.1
         * blist==1.3.6

        == NEXUS ==
        '''Analysis'''
         *
        https://github.com/dataplumber/nexus/blob/master/
analysis/package-list.txt
         *
        https://github.com/dataplumber/nexus/blob/master/
analysis/requirements.txt

        '''Client'''
         * https://github.com/dataplumber/nexus/blob/master/
client/requirements.txt

        '''Climatology'''
         * matplotlib
         * numpy
         * netCDF4
         * pathos (https://pypi.python.org/pypi/pathos)

        '''Data-access'''
         *
        https://github.com/dataplumber/nexus/blob/master/
data-access/requirements.txt

        '''Nexus-ingest'''
        ''Dataset-tiler''
         *
        https://github.com/dataplumber/nexus/tree/master/
nexus-ingest/dataset-tiler/build/reports

        ''developer-box''
         * Just a collection of scripts/vagrant file used to stand up a
developer
        instance of nexus ingestion. No dependencies to report

        ''Groovy-scripts''
         * Collection of Groovy scripts that can be used as part of data
ingestion.
        They only rely on the standard Groovy library and the
‘nexus-messages’
        project

        ''Nexus-messages''
         *
        https://github.com/dataplumber/nexus/tree/master/
nexus-ingest/nexus-messages/build/reports

        ''nexus-sink''
         *
        https://github.com/dataplumber/nexus/tree/master/
nexus-ingest/nexus-sink/build/reports

        ''nexus-xd-python-modules''
         *
        https://github.com/dataplumber/nexus/blob/master/
nexus-ingest/nexus-xd-python-modules/package-list.txt
         *
        https://github.com/dataplumber/nexus/blob/master/
nexus-ingest/nexus-xd-python-modules/requirements.txt

        ''spring-xd-python''
         * only python standard libraries are used

        ''tcp-shell''
         *
        https://github.com/dataplumber/nexus/tree/master/
nexus-ingest/tcp-shell/build/reports

        '''tools/deletebyquery'''
         *
        https://github.com/dataplumber/nexus/blob/master/
tools/deletebyquery/requirements.txt

        = Required Resources =
        Mailing Lists
         * priv...@sdap.incubator.apache.org
         * d...@sdap.incubator.apache.org
         * comm...@sdap.incubator.apache.org

        Git Repos
         * https://git-wip-us.apache.org/repos/asf/incubator-nexus.git
         * https://git-wip-us.apache.org/repos/asf/incubator-doms.git
         * https://git-wip-us.apache.org/repos/asf/incubator-mudrod.git

        Issue Tracking
         * JIRA Science Data Analytics Platform (SDAP)

        Continuous Integration
         * Jenkins builds on https://builds.apache.org/

        Web
         * http://sdap.incubator.apache.org/
         * wiki at http://cwiki.apache.org

        = Initial Committers =
        The following is a list of the planned initial Apache committers
(the
        active subset of the committers for the current repository on
Github).
         * Lewis John McGibbney (lewi...@apache.org)
         * Vardis M. Tsontos (vardis.m.tson...@jpl.nasa.gov)
         * Joseph C. Jacob (joseph.c.ja...@jpl.nasa.gov)
         * Ed Armstrong (edward.m.armstr...@jpl.nasa.gov)
         * Frank Greguska (gregu...@jpl.nasa.gov)
         * Brian Wilson (brian.wil...@jpl.nasa.gov)
         * Chaowe Phil Yang (cya...@gmu.edu)
         * Yongyao Jiang (yjia...@gmu.edu)
         * Yun Li (yl...@gmu.edu)
         * Shawn R. Smith (sm...@coaps.fsu.edu)
         * Jocelyn Elya (je...@coaps.fsu.edu)
         * Mark Bourassa (boura...@coaps.fsu.edu)
         * Thomas Cram (tc...@ucar.edu)
         * Thomas Huang (thomas.hu...@jpl.nasa.gov)
         * Steven Worley (wor...@ucar.edu)
         * Zaihua Ji (z...@ucar.edu)

        = Affiliations =
        NASA JPL
         * Lewis John McGibbney (lewi...@apache.org)
         * Vardis M. Tsontos (vardis.m.tson...@jpl.nasa.gov)
         * Joseph C. Jacob (joseph.c.ja...@jpl.nasa.gov)
         * Ed Armstrong (edward.m.armstr...@jpl.nasa.gov)
         * Frank Greguska (gregu...@jpl.nasa.gov)
         * Thomas Huang (thomas.hu...@jpl.nasa.gov)
         * Brian Wilson (brian.wil...@jpl.nasa.gov)

        George Mason University
         * Chaowe Phil Yang (cya...@gmu.edu)
         * Yongyao Jiang (yjia...@gmu.edu)
         * Yun Li (yl...@gmu.edu)

        Center for Ocean-Atmospheric Prediction Studies, Florida State
University
         * Shawn R. Smith (sm...@coaps.fsu.edu)
         * Jocelyn Elya (je...@coaps.fsu.edu)
         * Mark Bourassa (boura...@coaps.fsu.edu)

        Computational Information Systems Laboratory (CISL) / National
Center for
        Atmospheric Research (NCAR)
         * Thomas Cram (tc...@ucar.edu)
         * Zaihua Ji (z...@ucar.edu)
         * Steven Worley (wor...@ucar.edu)

        = Sponsors =

        = Champion =
        * Lewis McGibbney (NASA/JPL)

        = Nominated Mentors =
         * TBD
         * TBD
         * TBD

        = Sponsoring Entity =
        The Apache Incubator


        --
        http://home.apache.org/~lewismc/
        @hectorMcSpector
        http://www.linkedin.com/in/lmcgibbney






---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



--
My introduction https://youtu.be/Ln4vly5sxYU

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Accept Science Data Analytics Platform (SDAP) into Apache Incubator

Reply via email to