Interesting stuff! On Thu, Oct 12, 2017 at 8:00 PM, Huang, Thomas (398G) < thomas.hu...@jpl.nasa.gov> wrote:
> Thanks Chris. We are learning from the master ;-) > > Big thanks to Lewis for getting us here! > > > Thomas. > > -- > Thomas Huang > Jet Propulsion Laboratory > 4800 Oak Grove Drive, Mail Stop 158-242, Pasadena, CA 91109 > Phone: 818.354.2747, Email: thomas.hu...@jpl.nasa.gov > > DISCLAIMER: All personal and professional opinions presented herein are my > own and do not, in any way, represent the opinion or policy of JPL, NASA or > Caltech. > > > On 10/12/17, 10:29 AM, "Chris Mattmann" <mattm...@apache.org> wrote: > > Very interesting project Lewis and Thomas and glad to see this coming > to the Incubator! > > > > On 10/11/17, 11:22 AM, "lewis john mcgibbney" <lewi...@apache.org> > wrote: > > Hi Folks, > I would like to open a DISCUSS thread on the topic of accepting > the Science > Data Analytics Platform (SDAP) < > https://wiki.apache.org/incubator/SDAPProposal> Project into the > Incubator. > I am CC'ing Thomas Huang from NASA JPL who I have been working > with to > build community around a kick-ass set of software projects under > the SDAP > umbrella. > At this stage we would very much appreciate critical feedback from > general@ > community. We are also open to mentors who may have an interest in > the > project proposal. > The proposal is pasted below. > Thanks in advance, > Lewis > > = Abstract = > The Science Data Analytics Platform (SDAP) establishes an > integrated data > analytic center for Big Science problems. It focuses on technology > integration, advancement and maturity. > > = Proposal = > SDAP currently represents a collaboration between NASA Jet > Propulsion > Laboratory (JPL), Florida State University (FSU), the National > Center for > Atmospheric Research (NCAR), and George Mason University (GMU). > SDAP brings > together a number of big data technologies including a NASA funded > OceanXtremes (Anomaly detection and ocean science), NEXUS (Deep > data > analytic platform), DOMS (Distributed in-situ to satellite > matchup), MUDROD > (Search relevancy and discovery) and VQSS (Virtualized Quality > Screening > Service) under a single umbrella. Within the original Incubator > proposal, > VQSS will not be included however it is anticipated that a future > source > code donation will cover VQSS. > > = Background and Rationale = > SDAP is a technology software solution currently geared to better > enable > scientists involved in advancing the study of the Earth's physical > oceanography. With increasing global temperature, warming of the > ocean, and > melting ice sheets and glaciers, the impacts can be observed from > changes > in anomalous ocean temperature and circulation patterns, to > increasing > extreme weather events and stronger/more frequent hurricanes, sea > level > rise and storm surges affecting coastlines, and may involve > drastic changes > and shifts in marine ecosystems. Ocean science communities are > relying on > data distributed through data centers such as the JPL's Physical > Oceanographic Data Active Archive Center (PO.DAAC) to conduct their > research. In typical investigations, oceanographers follow a > traditional > workflow for using datasets: search, evaluate, download, and apply > tools > and algorithms to look for trends. While this workflow has been > working > very well historically for the oceanographic community, it cannot > scale if > the research involves massive amount of data. NASA's Surface Water > and > Ocean Topography (SWOT) mission, scheduled to launch in April of > 2021, is > expected to generate over 20PB data for a nominal 3-year mission. > This will > challenge all existing NASA Earth Science data > archival/distribution > paradigms. It will no longer be feasible for Earth scientists to > download > and analyze such volumes of data. SDAP was therefore developed > primarily as > a Web-service platform for big ocean data science at the PO.DAAC > with open > source solutions used to enable fast analysis of oceanographic > data. SDAP > has been developed collaboratively between JPL, FSU, NCAR, and GMU > and is > rapidly maturing to become the generic platform for the next > generation of > big science data solutions. The platform is an orchestration of > several > previously funded NASA big ocean data solutions using cloud > technology, > which include data analysis (NEXUS), anomaly detection > (OceanXtremes), > matchup (DOMS), subsetting, discovery (MUDROD), and visualization > (VQSS). > SDAP will enable web-accessible, fast data analysis directly on > huge > scientific data archives to minimize data movement and provide > access, > including subset, only to the relevant data. > > = Science Data Analytics Platform Project Overview = > SDAP consists of several loosely coupled, independently functioning > sub-projects. The graphic below displays an overview of how these > sub-projects fuse together. N.B., although the graphic uses > terminology > relating to OceanWorks, essentially the SDAP architecture is > identical. > > {{attachment:sdap.png}} > > == OceanXtremes == > Oceanographic Data-Intensive Anomaly Detection and Analysis > Portal. An > application that allows you to view imagery and perform analysis > on sea > level rise data. > > '''Objective''' > Develop an anomaly detection system which identifies items, events > or > observations which do not conform to an expected pattern. > * Mature and test domain-specific, multi-scale anomaly and feature > detection algorithms. > * Identify unexpected correlations between key measured variables. > > Demonstrate value of technologies in this service: > * Adapted Map-Reduce data mining. > * Algorithm profiling service. > * Shared discovery and exploration search tools. > * Automatic notification of events of interest. > > == NEXUS == > NEXUS is an emerging technology developed at JPL > * A Cloud-based/Cluster-based data platform that performs scalable > handling of observational parameters analysis designed to scale > horizontally > * Leveraging high-performance indexed, temporal, and geospatial > search > solution > * Breaks data products into small chunks and stores them in a > Cloud-based > data store > > ''Data Volumes Exploding'' > * SWOT mission is coming > * File I/O is slow > > ''Scalable Store & Compute is Available'' > * NoSQL cluster databases > * Parallel compute, in-memory map-reduce > * Bring Compute to Highly-Accessible Data (using Hybrid Cloud) > > ''Pre-Chunk and Summarize Key Variables'' > * Easy statistics instantly (milliseconds) > * Harder statistics on-demand (in seconds) > * Visualize original data (layers) on a map quickly > > == DOMS == > The Distributed Oceanographic Match-Up Service > DOMS is designed to reconcile satellite and in situ datasets in > support of > NASA's Earth Science mission. The service will provide a mechanism > for > users to input a series of geospatial references for satellite > observations > and receive the in situ observations that are matched to the > satellite data > within a selectable temporal and spatial domain. DOMS includes > several > characteristic in situ and satellite observation datasets - with > an initial > focus on salinity, sea temperature, and winds. DOMS will be used > by the > marine and satellite research communities to support a range of > activities > and several use cases will be described. The service is designed > to provide > a community-accessible tool that dynamically delivers matched data > and > allows the scientist to only work with the subset of data where > the matches > exist. > > == MUDROD == > Mining and Utilizing Dataset Relevancy from Oceanographic Datasets > to > Improve Data Discovery and Access > Data discovery accuracy is a challenging topic for both Earth > science and > other domains. It is especially true for scientific data sets that > are not > as popular as Amazon or Google data. MUDROD is focused on mining > oceanic > knowledge from the PO.DAAC user log files to improve the end user > data > discovery experience at PO.DAAC. There are three steps in the > research: a) > the oceanographic semantics were extracted from three resources of > SWEET, > GCMD ontology, and the keywords used by end users for searching > PO.DAAC > datasets, b) mining the linkage among different vocabularies based > on user > data discvoery sessions, and c) build the linkage among > vocabularies based > on a comprehensive approach by considering domain de facto > standard, e.g., > SWEET and GCMD, and the knowledge mined from the log files. The > semantics > is used to improve data discovery for ranking results, navigating > among > vocabularies, and recommending data based on user searchers. > > = Current Status = > All components of SDAP were originally designed and developed > under grants > from the NASA-funded Advanced Information Systems and Technologies > (AIST) > program. The initiative to bring them the components together > under the > SDAP umbrella was granted through an AIST-funded follow-on grant > which will > run for another ~18 or so months. > Currently no projects have made official releases so outside of > community > building, this will be our primary Incubating goal. All SDAP > source code is > currently publicly available and licensed under the ALv2.0. > > = Meritocracy = > The current developers are familiar with meritocratic open source > development at Apache. The SDAP team consumes Apache products > heavily with > members being part of several Apache user communities. SDAP itself > has > critical dependencies upon Apache products. Lewis McGibbney (JPL > employee), > a Member of the ASF and V.P. of Apache Any23, Gora PMC Nutch, > Tika, OODT, > OCW, etc., is championing the effort to bring SDAP into and > through the > Apache Incubator and has been evangelizing the Apache Way to the > current > SDAP contributors such that the meritocratic process is well > understood and > followed. Apache was chosen specifically because we want to > encourage this > style of community development for the project and for it to > sustain SDAP > forward to become the generic platform for the next generation of > big > science data solutions > > = Community = > The SDAP project is a fairly new effort and our community is not > yet > fully/firmly established. Initial committers comprising the SDAP > roster > have only recently fully come together as a unified team however > there is a > large degree of synergy between constituent members at JPL, FSU, > NCAR, and > GMU. Therefore, community building and publicity continues to be a > major > thrust. With the activity and exposure regularly attained by > several > community members, we hope to grow the SDAP presence in and across > several > (scientific) forums. The SDAP technology is generating interest > within > communities such as the Earth Science Information Partnership > (ESIP), > American Geophysical Union (AGU) and plethora or science meetings > around > the globe. This in effect, we hope, will further contribute > towards the > possibility of SDAP being used across Government Agencies such as > NASA, > NOAA, USGS, EPA, DOI, etc. as well as by researchers and students > in > academic institutions around the globe. > During incubation, we will explicitly seek to increase our > adoption, with > SDAP already being featured on the agenda for several high profile > globally > significant scientific conferences and meetings. > > = Core Developers = > The current set of core developers is relatively small, including > full-time > and students from across JPL, FSU, NCAR, and GMU. Initial community > management and participation will be distributed across the entire > team, > most of which have been involved with the constituent projects for > <2 > years. > > = Alignment = > All SDAP code is licensed under Apache v2.0. > > = Known Risks = > > == Orphaned products == > There are currently no orphaned products. Each component of SDAP > has > dedicated personnel leading and participating in its ongoing > development. > Additionally, there is substantial collaboration between projects > facilitated by regular project meetings which are specific the the > initial > member entities and focused on advancing physical oceanographic > science. > > == Inexperience with Open Source == > JPL (in particular Lewis McGibbney) has been part of several > efforts to > transition to and grow projects communities at Apache e.g. Apache > OODT, > Apache Open Climate Workbench, Apache Joshua (Incubating), Apache > SensSoft > (Incubating), Apache DRAT (Incubating). Most of the code developed > under > the SDAP umbrella was and is open source prior to the Incubator > effort so > we are well familiarized with the nuances of open source software. > > = Relationships with Other Apache Products = > SDAP has strong dependency upon a number of high profile and > smaller > profile Apache products. Examples can be seen in the breakdown of > External > Dependencies. As we continue to grow SDAP within the Incubator, we > will > make efforts to share community stories, software advancements and > possible > improvements in our use of our Apache dependencies back to those > project > communities. > > = Developers = > The SDAP project and hence developers is currently funded through > a NASA > AIST follow-on grant with funding secured for the next ~18 months. > There > are currently no 100% time dedicated developers, however, the same > core > team that does work currently will continue to work on the project > throughout the next current funding period and after. There is > currently no > business strategy aligned with SDAP however it is perceived that > future, > yet unsecured funding may by directed to further feature > advancement and > project evangelism. > > = Documentation = > Documentation is currently available in a number of locations e.g. > Github > wiki, Github pages, etc. with each repository under the > oceanworks-aist > Github Org maintaining documentation available through wiki’s > attached to > the repositories. Additionally, most of the SDAP sub-projects have > been > extensively documented within plethora of formal academic > publications > across several academic communities. It would be our intention, > certainly > atleast to unify the Github wiki ad Github pages documentation > most likely > to make up the sdap.apache.org Website content. > > = Initial Source = > Current source resides in several locations Github: > * https://github.com/dataplumber/nexus (NEXUS, OceanXtremes, > DOMS) > * https://github.com/dataplumber/edge (EDGE) > * https://github.com/aist-oceanworks/mudrod (MUDROD) > * https://bitbucket.org/coaps_mdc/doms/src (DOMS) > > = External Dependencies = > Each component of the Science Data Analytics Platform has its own > dependencies. Documentation will be available for integrating them. > > == MUDROD == > '''Core''' > com.google.code.gson gson 2.5 compile > jar false > org.jdom jdom 2.0.2 compile > jar false > org.elasticsearch elasticsearch 5.2.0 compile > jar false > org.elasticsearch elasticsearch-spark-20_2.11 5.2.0 compile > jar false > joda-time joda-time 2.9.4 compile > jar false > com.carrotsearch hppc 0.7.1 compile > jar false > org.apache.spark spark-core_2.11 2.1.0 compile > jar false > org.apache.spark spark-sql_2.11 2.1.0 compile > jar false > org.apache.spark spark-mllib_2.11 2.1.0 compile > jar false > org.scala-lang scala-library 2.11.8 compile > jar false > org.codehaus.jettison jettison 1.3.8 compile > jar false > commons-cli commons-cli 1.2 compile > jar false > net.sf.opencsv opencsv 2.3 compile > jar false > org.apache.jena jena-core 3.3.0 compile > jar false > junit junit 4.12 test > jar false > > '''Service''' > gov.nasa.jpl.mudrod mudrod-core 0.0.1-SNAPSHOT compile > jar false > javax.servlet javax.servlet-api 3.1.0 provided > jar false > com.google.code.gson gson 2.5 compile > jar false > > '''Web''' > * AngularJS - MIT License > * BootstrapJS - MIT License > * jQueryJS - MIT License > * Underscore JS - MIT License > > == DOMS == > * Apache Solr version 5.5.1http://lucene.apache.org/solr/ > * EDGE https://github.com/dataplumber/edge > * NetCDF4 http://unidata.github.io/netcdf4-python/ > * Python 3.5 (NOTE: only partial support for py2.7) > > Non stdlib Python dependencies: > * Jinja2==2.9.5 > * python-dateutil==2.6.0 > * cython==0.25.2 > * numpy==1.12.0 > * scipy==0.18.1 > * netCDF4==1.2.7 > * solrpy3 > * siphon==0.4.0 > * neo4j-driver==1.1.0 > * matplotlib==2.0.0 > * requests==2.13.0 > * shapely==1.5.17 > * flask==0.12 > * networkx==1.11 > * pyproj==1.9.5.1 > * blist==1.3.6 > > == NEXUS == > '''Analysis''' > * > https://github.com/dataplumber/nexus/blob/master/ > analysis/package-list.txt > * > https://github.com/dataplumber/nexus/blob/master/ > analysis/requirements.txt > > '''Client''' > * https://github.com/dataplumber/nexus/blob/master/ > client/requirements.txt > > '''Climatology''' > * matplotlib > * numpy > * netCDF4 > * pathos (https://pypi.python.org/pypi/pathos) > > '''Data-access''' > * > https://github.com/dataplumber/nexus/blob/master/ > data-access/requirements.txt > > '''Nexus-ingest''' > ''Dataset-tiler'' > * > https://github.com/dataplumber/nexus/tree/master/ > nexus-ingest/dataset-tiler/build/reports > > ''developer-box'' > * Just a collection of scripts/vagrant file used to stand up a > developer > instance of nexus ingestion. No dependencies to report > > ''Groovy-scripts'' > * Collection of Groovy scripts that can be used as part of data > ingestion. > They only rely on the standard Groovy library and the > ‘nexus-messages’ > project > > ''Nexus-messages'' > * > https://github.com/dataplumber/nexus/tree/master/ > nexus-ingest/nexus-messages/build/reports > > ''nexus-sink'' > * > https://github.com/dataplumber/nexus/tree/master/ > nexus-ingest/nexus-sink/build/reports > > ''nexus-xd-python-modules'' > * > https://github.com/dataplumber/nexus/blob/master/ > nexus-ingest/nexus-xd-python-modules/package-list.txt > * > https://github.com/dataplumber/nexus/blob/master/ > nexus-ingest/nexus-xd-python-modules/requirements.txt > > ''spring-xd-python'' > * only python standard libraries are used > > ''tcp-shell'' > * > https://github.com/dataplumber/nexus/tree/master/ > nexus-ingest/tcp-shell/build/reports > > '''tools/deletebyquery''' > * > https://github.com/dataplumber/nexus/blob/master/ > tools/deletebyquery/requirements.txt > > = Required Resources = > Mailing Lists > * priv...@sdap.incubator.apache.org > * d...@sdap.incubator.apache.org > * comm...@sdap.incubator.apache.org > > Git Repos > * https://git-wip-us.apache.org/repos/asf/incubator-nexus.git > * https://git-wip-us.apache.org/repos/asf/incubator-doms.git > * https://git-wip-us.apache.org/repos/asf/incubator-mudrod.git > > Issue Tracking > * JIRA Science Data Analytics Platform (SDAP) > > Continuous Integration > * Jenkins builds on https://builds.apache.org/ > > Web > * http://sdap.incubator.apache.org/ > * wiki at http://cwiki.apache.org > > = Initial Committers = > The following is a list of the planned initial Apache committers > (the > active subset of the committers for the current repository on > Github). > * Lewis John McGibbney (lewi...@apache.org) > * Vardis M. Tsontos (vardis.m.tson...@jpl.nasa.gov) > * Joseph C. Jacob (joseph.c.ja...@jpl.nasa.gov) > * Ed Armstrong (edward.m.armstr...@jpl.nasa.gov) > * Frank Greguska (gregu...@jpl.nasa.gov) > * Brian Wilson (brian.wil...@jpl.nasa.gov) > * Chaowe Phil Yang (cya...@gmu.edu) > * Yongyao Jiang (yjia...@gmu.edu) > * Yun Li (yl...@gmu.edu) > * Shawn R. Smith (sm...@coaps.fsu.edu) > * Jocelyn Elya (je...@coaps.fsu.edu) > * Mark Bourassa (boura...@coaps.fsu.edu) > * Thomas Cram (tc...@ucar.edu) > * Thomas Huang (thomas.hu...@jpl.nasa.gov) > * Steven Worley (wor...@ucar.edu) > * Zaihua Ji (z...@ucar.edu) > > = Affiliations = > NASA JPL > * Lewis John McGibbney (lewi...@apache.org) > * Vardis M. Tsontos (vardis.m.tson...@jpl.nasa.gov) > * Joseph C. Jacob (joseph.c.ja...@jpl.nasa.gov) > * Ed Armstrong (edward.m.armstr...@jpl.nasa.gov) > * Frank Greguska (gregu...@jpl.nasa.gov) > * Thomas Huang (thomas.hu...@jpl.nasa.gov) > * Brian Wilson (brian.wil...@jpl.nasa.gov) > > George Mason University > * Chaowe Phil Yang (cya...@gmu.edu) > * Yongyao Jiang (yjia...@gmu.edu) > * Yun Li (yl...@gmu.edu) > > Center for Ocean-Atmospheric Prediction Studies, Florida State > University > * Shawn R. Smith (sm...@coaps.fsu.edu) > * Jocelyn Elya (je...@coaps.fsu.edu) > * Mark Bourassa (boura...@coaps.fsu.edu) > > Computational Information Systems Laboratory (CISL) / National > Center for > Atmospheric Research (NCAR) > * Thomas Cram (tc...@ucar.edu) > * Zaihua Ji (z...@ucar.edu) > * Steven Worley (wor...@ucar.edu) > > = Sponsors = > > = Champion = > * Lewis McGibbney (NASA/JPL) > > = Nominated Mentors = > * TBD > * TBD > * TBD > > = Sponsoring Entity = > The Apache Incubator > > > -- > http://home.apache.org/~lewismc/ > @hectorMcSpector > http://www.linkedin.com/in/lmcgibbney > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > -- Tom Barber CTO Spicule LTD t...@spicule.co.uk http://spicule.co.uk @spiculeim <http://twitter.com/spiculeim> Schedule a meeting with me <http://meetme.so/spicule> GB: +44(0)5603641316 US: +18448141689 <https://leanpub.com/juju-cookbook>