Re: [PROPOSAL] Apache DataSketches

leerho Fri, 22 Feb 2019 21:46:42 -0800

I didn't realize that this mail list does not accept PDF files, apparently
only text.  So let me try one more time ... :)  Please let me know if
this works!

= Apache DataSketches Proposal[1] =

== Abstract ==

DataSketches.GitHub.io is an open source, high-performance library of
stochastic streaming algorithms commonly called "sketches" in the data
sciences. Sketches are small, stateful programs that process massive data
as a stream and can provide approximate answers, with mathematical
guarantees, to computationally difficult queries orders-of-magnitude faster
than traditional, exact methods.

This proposal is to move DataSketches to the Apache Software
Foundation(ASF) transferring ownership of its copyright intellectual
property to the ASF. Thereafter, DataSketches would be officially known as
Apache DataSketches and its evolution and governance would come under the
rules and guidance of the ASF.

== Introduction ==

The DataSketches library contains carefully crafted implementations of
sketch algorithms that meet rigorous standards of quality and performance
and provide capabilities required for large-scale production systems that
must process and analyze massive data. The DataSketches core repository is
written in Java with a parallel core repository written in C++ that
includes Python wrappers. The DataSketches library also includes special
repositories for extending the core library for Apache Hive and Apache Pig.
The sketches developed in the different languages share a common binary
storage format so that sketches created and stored in Java, for example,
can be fully used in C++, and visa versa. Because the stored sketch
"images" are just a "blob" of bytes (similar to picture images), they can
be shared across many different systems, languages and platforms.

The DataSketches documentation website, https://datasketches.github.io ,
includes general tutorials, a comprehensive research section with
references to relevant academic papers, extensive examples for using the
core library directly as well as examples for accessing the library in
Hive, Pig, and Apache Spark.

The DataSketches library also includes a characterization repository for
long running test programs that are used for studying accuracy and
performance of these sketches over wide ranges of input variables. The data
produced by these programs is used for generating the many performance
plots contained in the documentation website and for academic
publications.

The code repositories used for production are versioned and published to
Maven Central on periodic intervals as the library evolves.

The DataSketches library also includes several experimental repositories
for use-cases outside the large-scale systems environments, such as
sketches for mobile, IoT devices (Android), command-line access of the
sketch library, and an experimental repository for vector-based sketches
that performs approximate Singular Value Decomposition (SVD) analysis that
could potentially be used in Machine Learning (ML) applications.

== Background ==

The DataSketches library was started in 2012 as internal Yahoo project to
dramatically reduce time and resources required for distinct (unique)
counting. An extensive search on the Internet at the time yielded a number
of theoretical papers on stochastic streaming algorithms with pseudocode
examples, but we did not find any usable open-source code of the quality we
felt we needed for our internal production systems. So we started a small
project (one person) to develop our own sketches working directly from
published theoretical papers.

The DataSketches library was designed from the start with the objective of
making these algorithms, usually only described in theoretical papers,
easily accessible to systems developers for use in our internal production
systems. By necessity, the code had to be of the highest quality and
thoroughly tested. The wide variety of our internal production systems
drove the requirement that the sketch implementations had to have an
absolute minimum of external, run-time dependencies in order to simplify
integration and troubleshooting.

Our internal experiments demonstrated dramatic positive impact on the
performance of our systems. As a result, the DataSketches library quickly
evolved to include different types of sketches for different types of
queries, such as frequent-items (a.k.a, heavy-hitters) algorithms,
quantile/histogram algorithms, and weighted and unweighted sampling
algorithms.

We quickly discovered that developing these sketch algorithms to be truly
robust in production environments is quite difficult and requires deep
understanding of the underlying mathematics and statistics as well as
extensive experience in developing high quality code for 24/7 production
systems. This is a difficult combination of skills for any one organization
to collect and maintain over time. It became clear that this technology
needed a community larger than Yahoo to evolve. In November, 2015, this
factor, along with Yahoo’s strong experience and support of open source,
led to the decision to open source this technology under an Apache 2.0
license on GitHub. Since that time our community has expanded considerably
and the key contributors to this effort includes leading research
scientists from a number of universities as well as practitioners and
researchers from a number of major corporations. The core of this group is
very active as we meet weekly to discuss research directions and
engineering priorities.

It is important to note that our internal systems at Yahoo use the current
public GitHub open source DataSketches library and not an internal version
of the code.

The close collaboration of scientific research and engineering development
experience with actual massive-data processing systems has also produced
new research publications in the field of stochastic streaming algorithms,
for example:

* Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee Rhodes, and
Justin Thaler. A high-performance algorithm for identifying frequent items
in data streams. In ACM IMC 2017.

* Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A
framework for estimating stream expression cardinalities. In *EDBT/ICDT
Proceedings ‘16 *, pages 6:1–6:17, 2016.

* Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient Frequent
Directions Algorithm for Sparse Matrices. In ACM SIGKDD Proceedings ‘16,
pages 845-854, 2016.

* Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile
approximation in streams. In IEEE FOCS Proceedings ‘16, pages 71–78, 2016.

* Kevin J Lang. Back to the future: an even more nearly optimal cardinality
estimation algorithm. arXiv preprint https://arxiv.org/abs/1708.06839, 2017.

* Edo Liberty. Simple and deterministic matrix sketching. In ACM KDD
Proceedings ‘13, pages 581– 588, 2013.

* Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan Ullman.
Space lower bounds for itemset frequency sketches. In ACM PODS Proceedings
‘16, pages 441–454, 2016.

* Michael Mitzenmacher, Thomas Steinke, and Justin Thaler. Hierarchical
heavy hitters with the space saving algorithm. In SIAM ALENEX Proceedings
‘12, pages 160–174, 2012.

== The Rationale for Sketches ==

In the analysis of big data there are often problem queries that don’t
scale because they require huge compute resources and time to generate
exact results. Examples include count distinct, quantiles, most frequent
items, joins, matrix computations, and graph analysis.

If we can loosen the requirement of “exact” results from our queries and be
satisfied with approximate results, within some well understood bounds of
error, there is an entire branch of mathematics and data science that has
evolved around developing algorithms that can produce approximate results
with mathematically well-defined error properties.

With the additional requirements that these algorithms must be small
(compared to the size of the input data), sublinear (the size of the sketch
must grow at a slower rate than the size of the input stream), streaming
(they can only touch each data item once), and mergeable (suitable for
distributed processing), defines a class of algorithms that can be
described as small, stochastic, streaming, sublinear mergeable algorithms,
commonly called sketches (they also have other names, but we will use the
term sketches from here on).

To be truly streaming and be able to process data in a single pass,
sketches must make absolute minimum assumptions about the input stream.
This is critically important, as there is no “second chance” to process the
data.

For example, sketches should not make assumptions about the order of stream
items, the stream length, the dynamic range of values, or the distribution
of item occurrence frequencies. Sketches should be tolerant of NaNs, Nulls
and empty objects. About the only thing that the sketch needs to know about
the stream is how to extract items from it and what type the item is, e.g.,
is it a numeric value or a string.

As far as the sketch is concerned, the input stream is a sequence of items
in some unknown random order with unknown random values.

The sketch is essentially a complex state machine and combined with the
random input stream defines a stochastic process. We then apply
probabilistic methods to interpret the states of the stochastic process in
order to extract useful information about the input stream itself. The
resulting information will be approximate, but we also use additional
probabilistic methods to extract an estimate of the likely probability
distribution of error.

There is a significant scientific contribution here that is defining the
state machine, understanding the resulting stochastic process, developing
the probabilistic methods, and proving mathematically, that it all works!
This is why the scientific contributors to this project are a critical and
strategic component to our success. The development engineers translate
the concepts of the proposed state machine and probabilistic methods into
production-quality code. Even more important, they work closely with the
scientists, feeding back system and user requirements, which leads not only
to superior product design, but to new science as well. A number of
scientific papers our members have published (see above) is a direct result
of this close collaboration.

Because sketches are small they can be processed extremely fast, often many
orders-of-magnitude faster than traditional exact computations. For
interactive queries there may not be other viable alternatives, and in the
case of real-time analysis, sketches are the only known solution.

For any system that needs to extract useful information from massive data
sketches are essential tools that should be tightly integrated into the
system’s analysis capabilities. This technology has helped Yahoo
successfully reduce data processing times from days to hours or minutes on
a number of its internal platforms and has enabled subsecond queries on
real-time platforms that would have been infeasible without sketches.
The Rationale for Apache DataSketches
Other open source implementations of sketch algorithms can be found on the
Internet. However, we have not yet found any open source implementations
that are as comprehensive, engineered with the quality required for
production systems, and with usable and guaranteed error properties. Large
Internet companies, such as Google and Facebook, have published papers on
sketching, however, their implementations of their published algorithms are
proprietary and not available as open source.

The DataSketches library already provides integrations with a number of
major Apache data processing platforms such as Apache Hive, Apache Pig,
Apache Spark and Apache Druid, and is also integrated with a number of
other open source data processing platforms such as Splice Machine, GCHQ
Gaffer and PostgreSQL.

We believe that having DataSketches as an Apache project will provide an
immediate, worthwhile, and substantial contribution to the open source
community, will have a better opportunity to provide a meaningful
contribution to both the science and engineering of sketching algorithms,
and integrate with other Apache projects. In addition, this is a
significant opportunity for Apache to be the "go-to" destination for users
that want to leverage this exciting technology.

== Initial Goals ==

We are breaking our initial goals into short-term (2-6 months) and
intermediate to long-term ( 6 months to 2 years):

Our short-term goals include:

* Understanding and adapting to the Apache development process and
structures.

* Start refactoring codebase and move various DataSketches repositories
code to Apache Git repository.

* Continue development of new features, functions, and fixes.

* Specific sub-projects (e.g., C++ and Python) will continue to be
developed and expanded.

The intermediate to long term goals include:

* Completing the design and implementation of the C++ sketches to
complement what is already available in Java, and the Python wrappers of
those C++ sketches.

* Expanding the C++ build framework to include Windows and the popular
Linux variants.

* Continued engagement with the scientific research community on the
development of new algorithms for computationally difficult problems that
heretofore have not had a sketching solution.

== Current Status ==

The DataSketches GitHub project has been quite successful. As of this
writing (Feb, 2019) the number of downloads measured by the Nexus
Repository Manager at https://oss.sonatype.org has grown by nearly a factor
of 10 over the past year to about 55 thousand per month. The
DataSketches/sketches-core repository has about 560 stars and 141 forks,
which is pretty good for a highly specialized library.

=== Development Practices ===

==== Source Control ====

All of our developers have extensive experience with Git version control
and follow accepted practices for use of Pull Requests (PRs), code reviews
and commits to master, for example.

==== Testing ====

Sketches, by their nature are probabilistic programs and don’t necessarily
behave deterministically. For some of the sketches we intentionally insert
random noise into the code as this gives us the mathematical properties
that we need to guarantee accuracy. This can make the behavior of these
algorithms quite unintuitive and provides significant challenges to the
developer who wishes to test these algorithms for correctness. As a result,
our testing strategy includes two major components: unit tests, and
characterization tests.

===== Unit Testing =====

Our unit tests are primarily quick tests to make sure that we exercise all
critical paths in the code and that key branches are executed correctly. It
is important that they execute relatively fast as they are generally run on
every code build. The sketches-core repository alone has about 22 thousand
statements, over 1300 unit tests and code coverage of about 98.2% as
measured by Atlassian/Clover. It is our goal for all of our code
repositories that are used in production that they have code coverage
greater than 90%.

===== Characterization Testing =====

In order to test the probabilistic methods that are used to interpret the
stochastic behaviors of our sketches we have a separate characterization
repository that is dedicated to this. To measure accuracy, for example,
requires running thousands of trials at each of many different points along
the domain axis. Each trial compares its estimated results against a known
exact result producing an error for that trial. These error measurements
are then fed into our Quantiles sketch to capture the actual distribution
of error at that point along the axis. We then select quantile contours
across all the distributions at points along the axis. These contours can
then be plotted to reveal the shape of the actual error distribution. These
distributions are not at all Gaussian, in fact they can be quite complex.
Nonetheless, these distributions are then checked against our statistical
guarantees inherent to the specific sketch algorithm and its parameters.
There are many examples of these characterization error distributions on
our website. The runtimes of these tests can be very long and can range
from many minutes to hours, and some can run for days. Currently, we have
separate characterization repositories for Java and C++ / Python.

It is our goal that we perform this characterization analysis for all of
our sketches. By definition, the code that runs these characterization
tests is open-source so others can run these tests as well. We do not have
formal releases of this code (because it is not production code) and it is
not published to Maven Central.

=== Meritocracy ===

DataSketches was initially developed based on requirements within Yahoo. As
a project on GitHub, DataSketches has received contributions from numerous
individual developers from around the world, dedicated research work from
senior scientists at Amazon and Visa, and academic researchers from
Georgetown University, Princeton, and MIT.

As a project under incubation, we are committed to expanding our effort to
build an environment which supports a meritocracy. We are focused on
engaging the community and other related projects for support and
contributions. Moreover, we are committed to ensure contributors and
committers to DataSketches come from a broad mix of organizations through a
merit-based decision process during incubation. We believe strongly in the
DataSketches premise that fulfills the concept of a well engineered and
scientifically rigorous library that implements these powerful algorithms
and are committed to growing an inclusive community of DataSketches
contributors and users.

=== Community ===

Yahoo has a long history and active engagement in the Open Source
community. Major projects include: Vespa.ai, Bullet, Moloch, Panoptes,
Screwdriver.cd, Athenz, HaloDB, Maha, Mendel, TensorFlowOnSpark, gifshot,
fluxible, as well as the creation, contribution and incubation of many
Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie, Zookeeper,
Omid, Pulsar, Traffic Server, Storm, Druid, and many more.

Every day, DataSketches is actively used by a organizations and
institutions around the world for batch and stream processing of data. We
believe acceptance will allow us to consolidate existing
DataSketches-related work, grow the DataSketches community, and deepen
connections between DataSketches and other open source projects.

=== Introduction to the Core Developers & Contributors ===

The core developers and contributors for DataSketches are from diverse
backgrounds, but primarily are scientists that love engineering and
engineers that love science. A large part of the value we bring comes from
this synthesis. These individuals have already contributed substantially
to the code, algorithms, and/or mathematical proofs that form the basis of
the library.

This core group also form the Initial Committers with write permissions to
the repository. Those marked with (*) Meet weekly to plan the research and
engineering direction of the project.

==== Scientists That Love Engineering ====

* Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel. Interests:
distributed systems, scalable systems and platforms for big data
processing, concurrent algorithms and data structures,

* Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs, Sunnyvale,
California. Interests: algorithms, theoretical and applied mathematics,
encoding and compression theory, theoretical and applied performance
optimization.

* Edo Liberty: (*) Director of Research, Head of Amazon AI Labs, Palo Alto,
California. Manages the algorithms group at Amazon AI. We build scalable
machine learning systems and algorithms which are used both internally and
externally by customers of SageMaker, AWS's flagship machine learning
platform.

* Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale. Interests:
Computational advertising, machine learning, speech recognition,
data-driven analysis, large scale experimentation, big data, stream/complex
event processing

* Justin Thaler: (*) Assistant Professor, Department of Computer Science,
Georgetown University, Washington D.C. Interests: algorithms and
computational complexity, complexity theory, quantum algorithms, private
data analysis, and learning theory, developing efficient streaming and
sketching algorithms

==== Engineers That Love Science ====

* Roman Leventov: Senior Software Engineer, Metamarkets / Snap. Interests:
design and implementation of data storing and data processing (distributed)
systems, performance optimization, CPU performance, mechanical sympathy,
JVM performance, API design, databases, (concurrent) data structures,
memory management, garbage collection algorithms, language design and
runtimes (their tradeoffs), distributed systems (cloud) efficiency, Linux,
code quality, code transformation, pure functional programming models,
Haskell.

* Lee Rhodes: (*) Distinguished Architect, lead developer and founder of
the DataSketches project, Yahoo, Sunnyvale, California. Interests:
streaming algorithms, mathematics, computer science, high quality and high
performance code for the analysis of massive data, bridging the divide
between theory and practice.

* Alexander Saydakov: (*) Senior Software Engineer, Yahoo, Sunnyvale,
California. Interests: applied mathematics, computer science, big data,
distributed systems.

=== Introduction to Additional Interested Contributors ===

These folks have been intermittently involved and contributed, but are
strong supporters of this project.

* Frank Grimes: GitHub ID: frankgrimes97

* Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer Science,
Univ of Utah. Interests: Machine Learning, Data Mining, matrix
approximation, streaming algorithms, randomized linear algebra.

* Christopher Musco: [christopher.musco at gmail dot com] Ph.D. Computer
Science, Research Instructor, Princeton University. Interests: algorithmic
foundations of data science and machine learning, efficient methods for
processing and understanding large datasets, often working at the
intersection of theoretical computer science, numerical linear algebra, and
optimization.

* Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. Computer Science,
Professor, Warwick University, Warwick, England. Interests: all aspects of
the "data lifecycle", from data collection and cleaning, through mining and
analytics. (Professor Cormode is one of the world’s leading scientists in
sketching algorithms)

=== Alignment ===

The DataSketches library already provides integrations and example code for
Apache Hive, Apache Pig, Apache Spark and is deeply integrated into Apache
Druid.

== Known Risks ==

The following subsections are specific risks that have been identified by
the ASF that need to be addressed.

=== Risk: Orphaned Products ===

The DataSketches library is presently used by a number of organizations,
from small startups to Fortune 100 companies, to construct production
pipelines that must process and analyze massive data. Yahoo has a long-term
commitment to continue to advance the DataSketches library; moreover,
DataSketches is seeing increasing interest, development, and adoption from
many diverse organizations from around the world. Due to its growing
adoption, we feel it is quite unlikely that this project would become
orphaned.

=== Risk: Inexperience with Open Source ===

Yahoo believes strongly in open source and the exchange of information to
advance new ideas and work. Examples of this commitment are active open
source projects such as those mentioned above. With DataSketches, we have
been increasingly open and forward-looking; we have published a number of
papers about breakthrough developments in the science of streaming
algorithms (mentioned above) that also reference the DataSketches library.
Our submission to the Apache Software Foundation is a logical extension of
our commitment to open source software.

Key committers at Yahoo with strong open source backgrounds include Aaron
Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky, Andrews
Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan Call, Daryn
Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar Hillel,
Ethan Li, Fei Deng, Francis Christopher Liu, Francisco Perez-Sorrosal, Gil
Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher, James Penick,
Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon Eagles, Kihwal
Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael Trelinski,
Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L. Natkovich,
Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby Loo, Ryan
Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit Chan, Sri
Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many more.

All of our core developers are committed to learn about the Apache process
and to give back to the community.

=== Risk: Homogeneous Developers ===

The majority of committers in this proposal belong to Yahoo due to the fact
that DataSketches has emerged from an internal Yahoo project. This proposal
also includes developers and contributors from other companies, and who are
actively involved with other Apache projects, such as Druid. We expect our
entry into incubation will allow us to expand the number of individuals and
organizations participating in DataSketches development.

=== Risk: Reliance on Salaried Developers ===

Because the DataSketches library originated within Yahoo, it has been
developed primarily by salaried Yahoo developers and we expect that to
continue to be the case near term. However, since we placed this library
into open-source we have had a number of significant contributions from
engineers and scientists from outside of Yahoo. We expect our reliance on
Yahoo salaried developers will decrease over time. Nonetheless, Yahoo is
committed to continue its strong support of this important project.

=== Risk: Lack of Relationship to other Apache Products ===

DataSketches already directly interoperates with or utilizes several
existing Apache projects.

* Build
* Apache Maven

* Integrations and adaptors for the following projects naturally have them
as dependencies
* Apache Hive
* Apache Pig
* Apache Druid
* Apache Spark

* Additional dependencies for the above integrations and adaptors include
* Apache Hadoop
* Apache Commons (Math)

There is no other Apache project that we are aware of that duplicates the
functionality of the DataSketches library.

=== Risk: An Excessive Fascination with the Apache Brand ===

With this proposal we are not seeking attention or publicity. Rather, we
firmly believe in the DataSketches library and concept and the ability to
make the DataSketches library a powerful, yet simple-to-use toolkit for
data processing. While the DataSketches library has been open source, we
believe putting code on GitHub can only go so far. We see the Apache
community, processes, and mission as critical for ensuring the DataSketches
library is truly community-driven, positively impactful, and innovative
open source software. While Yahoo has taken a number of steps to advance
its various open source projects, we believe the DataSketches library
project is a great fit for the Apache Software Foundation due to its focus
on data processing and its relationships to existing ASF projects.

=== Risk: Cryptography ===

DataSketches does not contain any cryptographic code and is not a
cryptographic product.

== Documentation ==

The following documentation is relevant to this proposal. Relevant portions
of the documentation will be contributed to the Apache DataSketches project.

* DataSketches website: https://datasketches.github.io.

* DataSketches website repository:
https://github.com/DataSketches/DataSketches.github.io

We will need an apache website for this documentation similar to

* https://datasketches.apache.org

== Initial Source ==

The initial source for DataSketches which we will submit to the Apache
Foundation will include a number of repositories which are currently hosted
under the GitHub.com/datasketches organization:

All github.com/datasketches repositories including:

* Java
* sketches-core: This repository has the core sketching classes, which
are leveraged by some of the other repositories. This repository has no
external dependencies outside of the DataSketches/memory repository, Java
and TestNG for unit tests. This code is versioned and the latest release
can be obtained from Maven Central.
* memory: Low level, high-performance memory data-structure management
primarily for off-heap.
* sketches-android: This is a new repository dedicated to sketches
designed to be run in a mobile client, such as a cell phone. It is still in
development and should be considered experimental.
* sketches-hive: This repository contains Hive UDFs and UDAFs for use
within Hadoop grid environments. This code has dependencies on
sketches-core as well as Hadoop and Hive. Users of this code are advised to
use Maven to bring in all the required dependencies. This code is versioned
and the latest release can be obtained from Maven Central.
* sketches-pig: This repository contains Pig User Defined Functions
(UDF) for use within Hadoop grid environments. This code has dependencies
on sketches-core as well as Hadoop and Pig. Users of this code are advised
to use Maven to bring in all the required dependencies. This code is
versioned and the latest release can be obtained from Maven Central.
* sketches-vector: This is a new repository dedicated to sketches for
vector and matrix operations. It is still somewhat experimental.
* characterization: This relatively new repository is for code that we
use to characterize the accuracy and speed performance of the sketches in
the library and is constantly being updated. Examples of the job command
files used for various tests can be found in the src/main/resources
directory. Some of these tests can run for hours depending on its
configuration.
* experimental: This repository is an experimental staging area for code
that will eventually end up in another repository. This code is not
versioned and not registered with Maven Central.
* sketches-misc: Demos and other code not related to production
deployment

* C++ and Python
* sketches-core-cpp: This is the C++/Python companion to the Java
sketches-core. These implementations are binary compatible with their
counterparts in Java. In other words, a sketch created and stored in C++
can be opened and read in Java and visa-versa. This site also has our
Python adaptors that basically wrap the C++ implementations, making the
high performance C++ implementations available from Python.
* sketches-postgres: This site provides the postgres-specific adaptors
that wrap the C++ implementations making them available to the Postgres
database users.
* characterization-cpp: This is the C++/Python companion to the Java
characterization repository.
* experimental-cpp: This repository is an experimental staging area for
C++ code that will eventually end up in another repository.

* Command-Line Tools
* sketches-cmd
* homebrew-sketches
* homebrew-sketches-cmd

These projects have always been Apache 2.0 licensed. We intend to bundle
all of these repositories since they are all complementary and should be
maintained in one project. Prior to our submission, we will combine all of
these projects into a new git repository.

== Source and Intellectual Property Submission Plan ==

Contributors to the DataSketches project have also signed the Yahoo
Individual Contributor License Agreement (https://yahoocla.herokuapp.com/
in order to contribute to the project.

With respect to trademark rights, Yahoo does not hold a trademark on the
phrase “DataSketches.” Based on feedback and guidance we receive during the
incubation process, we are open to renaming the project if necessary for
trademark or other concerns, but we would prefer not to have to do that.

== External Dependencies ==

All external dependencies are licensed under an Apache 2.0 or
Apache-compatible license. As we grow the DataSketches community we will
configure our build process to require and validate all contributions and
dependencies are licensed under the Apache 2.0 license or are under an
Apache-compatible license.

== Required Resources ==

=== Mailing Lists ===

We currently use a mix of mailing lists. We will migrate our existing
mailing lists to the following:

* d...@datasketches.incubator.apache.org

* u...@datasketches.incubator.apache.org

* priv...@datasketches.incubator.apache.org

* comm...@datasketches.incubator.apache.org

=== Source Control ===

The DataSketches team currently uses Git and would like to continue to do
so. We request a Git repository for DataSketches with mirroring to GitHub
enabled similar the following:

* https://github.com/apache/incubator-datasketches.git

=== Issue Tracking ===

We request the creation of an Apache-hosted JIRA. The DataSketches project
is currently using the public GitHub issue tracker and the public Google
Groups forum/sketches-user for issue tracking and discussions. We will
migrate and combine from these two sources to the Apache JIRA.

Proposed Jira ID: DATASKETCHES

== Initial Committers ==

The following list of individuals have been extremely active in our
community and should have write (commit) permissions to the repository.

* Eshcar Hillel [eshcar at verizonmedia dot com]

* Kevin Lang [langk at verizonmedia dot com]

* Roman Leventov [roman.leventov at c.metamarkets dot com]

* Edo Liberty [libertye at amazon dot com]

* Jon Malkin [jmalkin at verizonmedia dot com]

* Lee Rhodes [lrhodes at verizonmedia dot com] & [leerho
at gmail dot com]

* Alexander Saydakov [saydakov at verizonmedia dot com]

* Justin Thaler [justin.thaler at georgetown dot edu]

== Affiliations ==

The initial committers are from four organizations: Yahoo, Amazon,
Georgetown University, and Metamarkets/Snap.

=== Champion ===
(Recommended to me: )

Liang Chen, Vice President of Apache CarbonData, [chenliang613 at apache
dot org]
Jean-Baptiste Onofré,[[jb at nanthrax dot net]

=== Nominated Mentors ===
(Recommended to me: )

Liang Chen, Vice President of Apache CarbonData, [chenliang613 at apache
dot org]
Jean-Baptiste Onofré, jb at nanthrax dot net
Gil Yehuda, gyehuda at verizonmedia dot com

=== Sponsoring Entity ===

* The Apache Incubator **** This is our 1st choice ****

* Apache Druid. The incubating Apache Druid project might also be a logical
sponsor. However, DataSketches has applications in many areas of computing
outside of Druid so our preference and recommendation is that DataSketches
would ultimately be a top-level Apache project.

________________
[1] In 2017 Verizon acquired Yahoo and merged it with previously acquired
AOL. The merged entity was originally called Oath, Inc., but has recently
been renamed Verizon Media, Inc., a wholly-owned subsidiary of Verizon,
Inc. Since Yahoo is the more recognized name, references in this document
to Yahoo, are also a reference to Verizon Media, Inc.

On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <k...@apache.org> wrote:

> The subject line has me interested already. Follow examples like this
> maybe?
>
> 1.
>
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> 2.
>
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
>
> Kenn
>
> On Fri, Feb 22, 2019 at 8:05 PM leerho <lee...@gmail.com> wrote:
>
> > I'll try again ... :)
> >
> > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> >
> >> It didn't make it again
> >>
> >> On Fri, Feb 22, 2019, 8:35 PM leerho <lee...@gmail.com> wrote:
> >>
> >> > I'm not sure the attached document made it through.
> >> >
> >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <lee...@gmail.com> wrote:
> >> >
> >> > >
> >> > >
> >> >
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
>

Re: [PROPOSAL] Apache DataSketches

Reply via email to