Re: DataSketches Proposal - Google Docs Link

leerho Mon, 25 Feb 2019 21:44:40 -0800

Thank you!


On Mon, Feb 25, 2019 at 9:37 PM Kenneth Knowles <k...@apache.org> wrote:

> It isn't too much work, so I've done it:
> https://s.apache.org/datasketches-proposal-draft
>
> Kenn
>
> On Mon, Feb 25, 2019 at 9:31 PM leerho <lee...@gmail.com> wrote:
>
> > Yes, I thought of that.  But it’s not like I’m being overwhelmed with
> > requests to comment ... so far it has been only 3 or 4, and the requested
> > changes have been minor.  I’m assuming that if there are no more
> > substantive changes after this week that the document would be moved to
> the
> > wiki archive, where, I presume, changes could still be made.
> >
> > I want to do the right thing here, so if you feel that the document would
> > get much better feedback on an unrestricted gDoc site, I will set it up.
> >
> >
> >
> > On Mon, Feb 25, 2019 at 8:32 PM Jim Apple <jbap...@cloudera.com.invalid>
> > wrote:
> >
> > > You could use a Google account that is not under Yahoo’s control, then
> > let
> > > anyone in the world add a comment, maybe.
> > >
> > > On Mon, Feb 25, 2019 at 3:26 PM leerho <lee...@gmail.com> wrote:
> > >
> > > > Ken,
> > > > Yahoo does not allow me to create a shared link outside our company,
> > > except
> > > > to individual email addresses.  So attempting to share it to the
> email
> > > > general@incubator.apache.org may not work.  Nonetheless, several
> > > > individuals were able to request access using their individual email
> > > > accounts and I was able to add them.  I will try to add you using
> > > > k...@apache.org, but if that doesn't work, I may need a gmail or
> > > > equivalent
> > > > account for you.
> > > >
> > > > Lee.
> > > >
> > > >
> > > > On Mon, Feb 25, 2019 at 2:59 PM Kenneth Knowles <k...@apache.org>
> > wrote:
> > > >
> > > > > I could not access that document. I suggest you need to turn on
> link
> > > > > sharing.
> > > > >
> > > > > Kenn
> > > > >
> > > > > On Mon, Feb 25, 2019 at 12:00 PM lee...@gmail.com <
> lee...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Try this link:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
> > > > > >
> > > > > >
> > > > > > On 2019/02/25 05:55:50, leerho <lee...@gmail.com> wrote:
> > > > > > > Yes I will try that tomorrow.
> > > > > > >
> > > > > > > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <
> k...@apache.org
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Can you share the Google doc with the proposal? Per Ted's
> > advice,
> > > > we
> > > > > > can
> > > > > > > > iterate quickly there and move it to the wiki when it
> becomes a
> > > bit
> > > > > > more
> > > > > > > > stable.
> > > > > > > >
> > > > > > > > Kenn
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019 at 10:21 PM lee...@gmail.com <
> > > > lee...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the offer.  i am a neophyte at this process and
> > > email
> > > > > > app!   I
> > > > > > > > > could use a lot of help getting this off the ground!  Also,
> > I'm
> > > > not
> > > > > > sure
> > > > > > > > > that Mr. Chen and Mr. Onofré have fully accepted taking
> this
> > on
> > > > :)
> > > > > > > > >
> > > > > > > > > Lee.
> > > > > > > > >
> > > > > > > > > On 2019/02/23 06:03:58, Kenneth Knowles <k...@apache.org>
> > > wrote:
> > > > > > > > > > Nice.
> > > > > > > > > >
> > > > > > > > > > I would very much like to help mentor this project,
> though
> > > you
> > > > > > already
> > > > > > > > > have
> > > > > > > > > > a couple good ones.
> > > > > > > > > >
> > > > > > > > > > I concur with incubator as sponsoring entity.
> > > > > > > > > >
> > > > > > > > > > Kenn (VP Apache Beam)
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <lee...@gmail.com
> >
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I didn't realize that this mail list does not accept
> PDF
> > > > files,
> > > > > > > > > apparently
> > > > > > > > > > > only text.  So let me try one more time ... :)  Please
> > let
> > > me
> > > > > > know if
> > > > > > > > > > > this works!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > = Apache DataSketches Proposal[1] =
> > > > > > > > > > >
> > > > > > > > > > > == Abstract ==
> > > > > > > > > > >
> > > > > > > > > > > DataSketches.GitHub.io is an open source,
> > high-performance
> > > > > > library
> > > > > > > > of
> > > > > > > > > > > stochastic streaming algorithms commonly called
> > "sketches"
> > > in
> > > > > the
> > > > > > > > data
> > > > > > > > > > > sciences. Sketches are small, stateful programs that
> > > process
> > > > > > massive
> > > > > > > > > data
> > > > > > > > > > > as a stream and can provide approximate answers, with
> > > > > > mathematical
> > > > > > > > > > > guarantees, to computationally difficult queries
> > > > > > orders-of-magnitude
> > > > > > > > > faster
> > > > > > > > > > > than traditional, exact methods.
> > > > > > > > > > >
> > > > > > > > > > > This proposal is to move DataSketches to the Apache
> > > Software
> > > > > > > > > > > Foundation(ASF) transferring ownership of its copyright
> > > > > > intellectual
> > > > > > > > > > > property to the ASF.  Thereafter, DataSketches would be
> > > > > > officially
> > > > > > > > > known as
> > > > > > > > > > > Apache DataSketches and its evolution and governance
> > would
> > > > come
> > > > > > under
> > > > > > > > > the
> > > > > > > > > > > rules and guidance of the ASF.
> > > > > > > > > > >
> > > > > > > > > > > == Introduction ==
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library contains carefully crafted
> > > > > > implementations
> > > > > > > > of
> > > > > > > > > > > sketch algorithms that meet rigorous standards of
> quality
> > > and
> > > > > > > > > performance
> > > > > > > > > > > and provide capabilities required for large-scale
> > > production
> > > > > > systems
> > > > > > > > > that
> > > > > > > > > > > must process and analyze massive data. The DataSketches
> > > core
> > > > > > > > > repository is
> > > > > > > > > > > written in Java with a parallel core repository written
> > in
> > > > C++
> > > > > > that
> > > > > > > > > > > includes Python wrappers. The DataSketches library also
> > > > > includes
> > > > > > > > > special
> > > > > > > > > > > repositories for extending the core library for Apache
> > Hive
> > > > and
> > > > > > > > Apache
> > > > > > > > > Pig.
> > > > > > > > > > > The sketches developed in the different languages
> share a
> > > > > common
> > > > > > > > binary
> > > > > > > > > > > storage format so that sketches created and stored in
> > Java,
> > > > for
> > > > > > > > > example,
> > > > > > > > > > > can be fully used in C++, and visa versa.  Because the
> > > stored
> > > > > > sketch
> > > > > > > > > > > "images" are just a "blob" of bytes (similar to picture
> > > > > images),
> > > > > > they
> > > > > > > > > can
> > > > > > > > > > > be shared across many different systems, languages and
> > > > > platforms.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches documentation website,
> > > > > > > > https://datasketches.github.io
> > > > > > > > > ,
> > > > > > > > > > > includes general tutorials, a comprehensive research
> > > section
> > > > > with
> > > > > > > > > > > references to relevant academic papers, extensive
> > examples
> > > > for
> > > > > > using
> > > > > > > > > the
> > > > > > > > > > > core library directly as well as examples for accessing
> > the
> > > > > > library
> > > > > > > > in
> > > > > > > > > > > Hive, Pig, and Apache Spark.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library also includes a
> characterization
> > > > > > repository
> > > > > > > > > for
> > > > > > > > > > > long running test programs that are used for studying
> > > > accuracy
> > > > > > and
> > > > > > > > > > > performance of these sketches over wide ranges of input
> > > > > > variables.
> > > > > > > > The
> > > > > > > > > data
> > > > > > > > > > > produced by these programs is used for generating the
> > many
> > > > > > > > performance
> > > > > > > > > > > plots contained in the documentation website and for
> > > academic
> > > > > > > > > > > publications.
> > > > > > > > > > >
> > > > > > > > > > > The code repositories used for production are versioned
> > and
> > > > > > published
> > > > > > > > > to
> > > > > > > > > > > Maven Central on periodic intervals as the library
> > evolves.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library also includes several
> > experimental
> > > > > > > > > repositories
> > > > > > > > > > > for use-cases outside the large-scale systems
> > environments,
> > > > > such
> > > > > > as
> > > > > > > > > > > sketches for mobile, IoT devices (Android),
> command-line
> > > > access
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > sketch library, and an experimental repository for
> > > > vector-based
> > > > > > > > > sketches
> > > > > > > > > > > that performs approximate Singular Value Decomposition
> > > (SVD)
> > > > > > analysis
> > > > > > > > > that
> > > > > > > > > > > could potentially be used in Machine Learning (ML)
> > > > > applications.
> > > > > > > > > > >
> > > > > > > > > > > == Background ==
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library was started in 2012 as
> internal
> > > > Yahoo
> > > > > > > > project
> > > > > > > > > to
> > > > > > > > > > > dramatically reduce time and resources required for
> > > distinct
> > > > > > (unique)
> > > > > > > > > > > counting.  An extensive search on the Internet at the
> > time
> > > > > > yielded a
> > > > > > > > > number
> > > > > > > > > > > of theoretical papers on stochastic streaming
> algorithms
> > > with
> > > > > > > > > pseudocode
> > > > > > > > > > > examples, but we did not find any usable open-source
> code
> > > of
> > > > > the
> > > > > > > > > quality we
> > > > > > > > > > > felt we needed for our internal production systems.  So
> > we
> > > > > > started a
> > > > > > > > > small
> > > > > > > > > > > project (one person) to develop our own sketches
> working
> > > > > directly
> > > > > > > > from
> > > > > > > > > > > published theoretical papers.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library was designed from the start
> with
> > > the
> > > > > > > > > objective of
> > > > > > > > > > > making these algorithms, usually only described in
> > > > theoretical
> > > > > > > > papers,
> > > > > > > > > > > easily accessible to systems developers for use in our
> > > > internal
> > > > > > > > > production
> > > > > > > > > > > systems. By necessity, the code had to be of the
> highest
> > > > > quality
> > > > > > and
> > > > > > > > > > > thoroughly tested. The wide variety of our internal
> > > > production
> > > > > > > > systems
> > > > > > > > > > > drove the requirement that the sketch implementations
> had
> > > to
> > > > > > have an
> > > > > > > > > > > absolute minimum of external, run-time dependencies in
> > > order
> > > > to
> > > > > > > > > simplify
> > > > > > > > > > > integration and troubleshooting.
> > > > > > > > > > >
> > > > > > > > > > > Our internal experiments demonstrated dramatic positive
> > > > impact
> > > > > > on the
> > > > > > > > > > > performance of our systems.  As a result, the
> > DataSketches
> > > > > > library
> > > > > > > > > quickly
> > > > > > > > > > > evolved to include different types of sketches for
> > > different
> > > > > > types of
> > > > > > > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
> > > > > > algorithms,
> > > > > > > > > > > quantile/histogram algorithms, and weighted and
> > unweighted
> > > > > > sampling
> > > > > > > > > > > algorithms.
> > > > > > > > > > >
> > > > > > > > > > > We quickly discovered that developing these sketch
> > > algorithms
> > > > > to
> > > > > > be
> > > > > > > > > truly
> > > > > > > > > > > robust in production environments is quite difficult
> and
> > > > > requires
> > > > > > > > deep
> > > > > > > > > > > understanding of the underlying mathematics and
> > statistics
> > > as
> > > > > > well as
> > > > > > > > > > > extensive experience in developing high quality code
> for
> > > 24/7
> > > > > > > > > production
> > > > > > > > > > > systems. This is a difficult combination of skills for
> > any
> > > > one
> > > > > > > > > organization
> > > > > > > > > > > to collect and maintain over time. It became clear that
> > > this
> > > > > > > > technology
> > > > > > > > > > > needed a community larger than Yahoo to evolve.  In
> > > November,
> > > > > > 2015,
> > > > > > > > > this
> > > > > > > > > > > factor, along with Yahoo’s strong experience and
> support
> > of
> > > > > open
> > > > > > > > > source,
> > > > > > > > > > > led to the decision to open source this technology
> under
> > an
> > > > > > Apache
> > > > > > > > 2.0
> > > > > > > > > > > license on GitHub. Since that time our community has
> > > expanded
> > > > > > > > > considerably
> > > > > > > > > > > and the key contributors to this effort includes
> leading
> > > > > research
> > > > > > > > > > > scientists from a number of universities as well as
> > > > > > practitioners and
> > > > > > > > > > > researchers from a number of major corporations. The
> core
> > > of
> > > > > this
> > > > > > > > > group is
> > > > > > > > > > > very active as we meet weekly to discuss research
> > > directions
> > > > > and
> > > > > > > > > > > engineering priorities.
> > > > > > > > > > >
> > > > > > > > > > > It is important to note that our internal systems at
> > Yahoo
> > > > use
> > > > > > the
> > > > > > > > > current
> > > > > > > > > > > public GitHub open source DataSketches library and not
> an
> > > > > > internal
> > > > > > > > > version
> > > > > > > > > > > of the code.
> > > > > > > > > > >
> > > > > > > > > > > The close collaboration of scientific research and
> > > > engineering
> > > > > > > > > development
> > > > > > > > > > > experience with actual massive-data processing systems
> > has
> > > > also
> > > > > > > > > produced
> > > > > > > > > > > new research publications in the field of stochastic
> > > > streaming
> > > > > > > > > algorithms,
> > > > > > > > > > > for example:
> > > > > > > > > > >
> > > > > > > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo
> > Liberty,
> > > > Lee
> > > > > > > > > Rhodes, and
> > > > > > > > > > > Justin Thaler. A high-performance algorithm for
> > identifying
> > > > > > frequent
> > > > > > > > > items
> > > > > > > > > > > in data streams. In ACM IMC 2017.
> > > > > > > > > > >
> > > > > > > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and
> Justin
> > > > > > Thaler. A
> > > > > > > > > > > framework for estimating stream expression
> cardinalities.
> > > In
> > > > > > > > *EDBT/ICDT
> > > > > > > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > > > > > > > >
> > > > > > > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips.
> Efficient
> > > > > > Frequent
> > > > > > > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
> > > > > > Proceedings
> > > > > > > > > ‘16,
> > > > > > > > > > > pages 845-854, 2016.
> > > > > > > > > > >
> > > > > > > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty.
> > Optimal
> > > > > > quantile
> > > > > > > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16,
> > > pages
> > > > > > 71–78,
> > > > > > > > > 2016.
> > > > > > > > > > >
> > > > > > > > > > > * Kevin J Lang. Back to the future: an even more nearly
> > > > optimal
> > > > > > > > > cardinality
> > > > > > > > > > > estimation algorithm. arXiv preprint
> > > > > > > > https://arxiv.org/abs/1708.06839,
> > > > > > > > > > > 2017.
> > > > > > > > > > >
> > > > > > > > > > > * Edo Liberty. Simple and deterministic matrix
> sketching.
> > > In
> > > > > ACM
> > > > > > KDD
> > > > > > > > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > > > > > > > >
> > > > > > > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and
> > > > > Jonathan
> > > > > > > > > Ullman.
> > > > > > > > > > > Space lower bounds for itemset frequency sketches. In
> ACM
> > > > PODS
> > > > > > > > > Proceedings
> > > > > > > > > > > ‘16, pages 441–454, 2016.
> > > > > > > > > > >
> > > > > > > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin
> > Thaler.
> > > > > > > > Hierarchical
> > > > > > > > > > > heavy hitters with the space saving algorithm. In SIAM
> > > ALENEX
> > > > > > > > > Proceedings
> > > > > > > > > > > ‘12, pages 160–174, 2012.
> > > > > > > > > > >
> > > > > > > > > > > == The Rationale for Sketches ==
> > > > > > > > > > >
> > > > > > > > > > > In the analysis of big data there are often problem
> > queries
> > > > > that
> > > > > > > > don’t
> > > > > > > > > > > scale because they require huge compute resources and
> > time
> > > to
> > > > > > > > generate
> > > > > > > > > > > exact results. Examples include count distinct,
> > quantiles,
> > > > most
> > > > > > > > > frequent
> > > > > > > > > > > items, joins, matrix computations, and graph analysis.
> > > > > > > > > > >
> > > > > > > > > > > If we can loosen the requirement of “exact” results
> from
> > > our
> > > > > > queries
> > > > > > > > > and be
> > > > > > > > > > > satisfied with approximate results, within some well
> > > > understood
> > > > > > > > bounds
> > > > > > > > > of
> > > > > > > > > > > error, there is an entire branch of mathematics and
> data
> > > > > science
> > > > > > that
> > > > > > > > > has
> > > > > > > > > > > evolved around developing algorithms that can produce
> > > > > approximate
> > > > > > > > > results
> > > > > > > > > > > with mathematically well-defined error properties.
> > > > > > > > > > >
> > > > > > > > > > > With the additional requirements that these algorithms
> > must
> > > > be
> > > > > > small
> > > > > > > > > > > (compared to the size of the input data), sublinear
> (the
> > > size
> > > > > of
> > > > > > the
> > > > > > > > > sketch
> > > > > > > > > > > must grow at a slower rate than the size of the input
> > > > stream),
> > > > > > > > > streaming
> > > > > > > > > > > (they can only touch each data item once), and
> mergeable
> > > > > > (suitable
> > > > > > > > for
> > > > > > > > > > > distributed processing), defines a class of algorithms
> > that
> > > > can
> > > > > > be
> > > > > > > > > > > described as small, stochastic, streaming, sublinear
> > > > mergeable
> > > > > > > > > algorithms,
> > > > > > > > > > > commonly called sketches (they also have other names,
> but
> > > we
> > > > > > will use
> > > > > > > > > the
> > > > > > > > > > > term sketches from here on).
> > > > > > > > > > >
> > > > > > > > > > > To be truly streaming and be able to process data in a
> > > single
> > > > > > pass,
> > > > > > > > > > > sketches must make absolute minimum assumptions about
> the
> > > > input
> > > > > > > > stream.
> > > > > > > > > > > This is critically important, as there is no “second
> > > chance”
> > > > to
> > > > > > > > > process the
> > > > > > > > > > > data.
> > > > > > > > > > >
> > > > > > > > > > > For example, sketches should not make assumptions about
> > the
> > > > > > order of
> > > > > > > > > stream
> > > > > > > > > > > items, the stream length, the dynamic range of values,
> or
> > > the
> > > > > > > > > distribution
> > > > > > > > > > > of item occurrence frequencies. Sketches should be
> > tolerant
> > > > of
> > > > > > NaNs,
> > > > > > > > > Nulls
> > > > > > > > > > > and empty objects. About the only thing that the sketch
> > > needs
> > > > > to
> > > > > > know
> > > > > > > > > about
> > > > > > > > > > > the stream is how to extract items from it and what
> type
> > > the
> > > > > > item is,
> > > > > > > > > e.g.,
> > > > > > > > > > > is it a numeric value or a string.
> > > > > > > > > > >
> > > > > > > > > > > As far as the sketch is concerned, the input stream is
> a
> > > > > > sequence of
> > > > > > > > > items
> > > > > > > > > > > in some unknown random order with unknown random
> values.
> > > > > > > > > > >
> > > > > > > > > > > The sketch is essentially a complex state machine and
> > > > combined
> > > > > > with
> > > > > > > > the
> > > > > > > > > > > random input stream defines a stochastic process. We
> then
> > > > apply
> > > > > > > > > > > probabilistic methods to interpret the states of the
> > > > stochastic
> > > > > > > > > process in
> > > > > > > > > > > order to extract useful information about the input
> > stream
> > > > > > itself.
> > > > > > > > The
> > > > > > > > > > > resulting information will be approximate, but we also
> > use
> > > > > > additional
> > > > > > > > > > > probabilistic methods to extract an estimate of the
> > likely
> > > > > > > > probability
> > > > > > > > > > > distribution of error.
> > > > > > > > > > >
> > > > > > > > > > > There is a significant scientific contribution here
> that
> > is
> > > > > > defining
> > > > > > > > > the
> > > > > > > > > > > state machine, understanding the resulting stochastic
> > > > process,
> > > > > > > > > developing
> > > > > > > > > > > the probabilistic methods, and proving mathematically,
> > that
> > > > it
> > > > > > all
> > > > > > > > > works!
> > > > > > > > > > > This is why the scientific contributors to this project
> > > are a
> > > > > > > > critical
> > > > > > > > > and
> > > > > > > > > > > strategic component to our success.  The development
> > > > engineers
> > > > > > > > > translate
> > > > > > > > > > > the concepts of the proposed state machine and
> > > probabilistic
> > > > > > methods
> > > > > > > > > into
> > > > > > > > > > > production-quality code. Even more important, they work
> > > > closely
> > > > > > with
> > > > > > > > > the
> > > > > > > > > > > scientists, feeding back system and user requirements,
> > > which
> > > > > > leads
> > > > > > > > not
> > > > > > > > > only
> > > > > > > > > > > to superior product design, but to new science as well.
> > A
> > > > > > number of
> > > > > > > > > > > scientific papers our members have published (see
> above)
> > > is a
> > > > > > direct
> > > > > > > > > result
> > > > > > > > > > > of this close collaboration.
> > > > > > > > > > >
> > > > > > > > > > > Because sketches are small they can be processed
> > extremely
> > > > > fast,
> > > > > > > > often
> > > > > > > > > many
> > > > > > > > > > > orders-of-magnitude faster than traditional exact
> > > > computations.
> > > > > > For
> > > > > > > > > > > interactive queries there may not be other viable
> > > > alternatives,
> > > > > > and
> > > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > case of real-time analysis, sketches are the only known
> > > > > solution.
> > > > > > > > > > >
> > > > > > > > > > > For any system that needs to extract useful information
> > > from
> > > > > > massive
> > > > > > > > > data
> > > > > > > > > > > sketches are essential tools that should be tightly
> > > > integrated
> > > > > > into
> > > > > > > > the
> > > > > > > > > > > system’s analysis capabilities. This technology has
> > helped
> > > > > Yahoo
> > > > > > > > > > > successfully reduce data processing times from days to
> > > hours
> > > > or
> > > > > > > > > minutes on
> > > > > > > > > > > a number of its internal platforms and has enabled
> > > subsecond
> > > > > > queries
> > > > > > > > on
> > > > > > > > > > > real-time platforms that would have been infeasible
> > without
> > > > > > sketches.
> > > > > > > > > > > The Rationale for Apache DataSketches
> > > > > > > > > > > Other open source implementations of sketch algorithms
> > can
> > > be
> > > > > > found
> > > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > Internet. However, we have not yet found any open
> source
> > > > > > > > > implementations
> > > > > > > > > > > that are as comprehensive, engineered with the quality
> > > > required
> > > > > > for
> > > > > > > > > > > production systems, and with usable and guaranteed
> error
> > > > > > properties.
> > > > > > > > > Large
> > > > > > > > > > > Internet companies, such as Google and Facebook, have
> > > > published
> > > > > > > > papers
> > > > > > > > > on
> > > > > > > > > > > sketching, however, their implementations of their
> > > published
> > > > > > > > > algorithms are
> > > > > > > > > > > proprietary and not available as open source.
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library already provides integrations
> > > with a
> > > > > > number
> > > > > > > > of
> > > > > > > > > > > major Apache data processing platforms such as Apache
> > Hive,
> > > > > > Apache
> > > > > > > > Pig,
> > > > > > > > > > > Apache Spark and Apache Druid, and is also integrated
> > with
> > > a
> > > > > > number
> > > > > > > > of
> > > > > > > > > > > other open source data processing platforms such as
> > Splice
> > > > > > Machine,
> > > > > > > > > GCHQ
> > > > > > > > > > > Gaffer and PostgreSQL.
> > > > > > > > > > >
> > > > > > > > > > > We believe that having DataSketches as an Apache
> project
> > > will
> > > > > > provide
> > > > > > > > > an
> > > > > > > > > > > immediate, worthwhile, and substantial contribution to
> > the
> > > > open
> > > > > > > > source
> > > > > > > > > > > community, will have a better opportunity to provide a
> > > > > meaningful
> > > > > > > > > > > contribution to both the science and engineering of
> > > sketching
> > > > > > > > > algorithms,
> > > > > > > > > > > and integrate with other Apache projects.  In addition,
> > > this
> > > > > is a
> > > > > > > > > > > significant opportunity for Apache to be the "go-to"
> > > > > destination
> > > > > > for
> > > > > > > > > users
> > > > > > > > > > > that want to leverage this exciting technology.
> > > > > > > > > > >
> > > > > > > > > > > == Initial Goals ==
> > > > > > > > > > >
> > > > > > > > > > > We are breaking our initial goals into short-term (2-6
> > > > months)
> > > > > > and
> > > > > > > > > > > intermediate to long-term ( 6 months to 2 years):
> > > > > > > > > > >
> > > > > > > > > > > Our short-term goals include:
> > > > > > > > > > >
> > > > > > > > > > > * Understanding and adapting to the Apache development
> > > > process
> > > > > > and
> > > > > > > > > > > structures.
> > > > > > > > > > >
> > > > > > > > > > > * Start refactoring codebase and move various
> > DataSketches
> > > > > > > > repositories
> > > > > > > > > > > code to Apache Git repository.
> > > > > > > > > > >
> > > > > > > > > > > * Continue development of new features, functions, and
> > > fixes.
> > > > > > > > > > >
> > > > > > > > > > > * Specific sub-projects (e.g., C++ and Python) will
> > > continue
> > > > to
> > > > > > be
> > > > > > > > > > > developed and expanded.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > The intermediate to long term goals include:
> > > > > > > > > > >
> > > > > > > > > > > * Completing the design and implementation of the C++
> > > > sketches
> > > > > to
> > > > > > > > > > > complement what is already available in Java, and the
> > > Python
> > > > > > wrappers
> > > > > > > > > of
> > > > > > > > > > > those C++ sketches.
> > > > > > > > > > >
> > > > > > > > > > > * Expanding the C++ build framework to include Windows
> > and
> > > > the
> > > > > > > > popular
> > > > > > > > > > > Linux variants.
> > > > > > > > > > >
> > > > > > > > > > > * Continued engagement with the scientific research
> > > community
> > > > > on
> > > > > > the
> > > > > > > > > > > development of new algorithms for computationally
> > difficult
> > > > > > problems
> > > > > > > > > that
> > > > > > > > > > > heretofore have not had a sketching solution.
> > > > > > > > > > >
> > > > > > > > > > > == Current Status ==
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches GitHub project has been quite
> > successful.
> > > > As
> > > > > of
> > > > > > > > this
> > > > > > > > > > > writing (Feb, 2019) the number of downloads measured by
> > the
> > > > > Nexus
> > > > > > > > > > > Repository Manager at https://oss.sonatype.org has
> grown
> > > by
> > > > > > nearly a
> > > > > > > > > > > factor
> > > > > > > > > > > of 10 over the past year to about 55 thousand per
> month.
> > > The
> > > > > > > > > > > DataSketches/sketches-core repository has about 560
> stars
> > > and
> > > > > 141
> > > > > > > > > forks,
> > > > > > > > > > > which is pretty good for a highly specialized library.
> > > > > > > > > > >
> > > > > > > > > > > === Development Practices ===
> > > > > > > > > > >
> > > > > > > > > > > ==== Source Control ====
> > > > > > > > > > >
> > > > > > > > > > > All of our developers have extensive experience with
> Git
> > > > > version
> > > > > > > > > control
> > > > > > > > > > > and follow accepted practices for use of Pull Requests
> > > (PRs),
> > > > > > code
> > > > > > > > > reviews
> > > > > > > > > > > and commits to master, for example.
> > > > > > > > > > >
> > > > > > > > > > > ==== Testing ====
> > > > > > > > > > >
> > > > > > > > > > > Sketches, by their nature are probabilistic programs
> and
> > > > don’t
> > > > > > > > > necessarily
> > > > > > > > > > > behave deterministically.  For some of the sketches we
> > > > > > intentionally
> > > > > > > > > insert
> > > > > > > > > > > random noise into the code as this gives us the
> > > mathematical
> > > > > > > > properties
> > > > > > > > > > > that we need to guarantee accuracy.  This can make the
> > > > behavior
> > > > > > of
> > > > > > > > > these
> > > > > > > > > > > algorithms quite unintuitive and provides significant
> > > > > challenges
> > > > > > to
> > > > > > > > the
> > > > > > > > > > > developer who wishes to test these algorithms for
> > > > correctness.
> > > > > > As a
> > > > > > > > > result,
> > > > > > > > > > > our testing strategy includes two major components:
> unit
> > > > tests,
> > > > > > and
> > > > > > > > > > > characterization tests.
> > > > > > > > > > >
> > > > > > > > > > > ===== Unit Testing =====
> > > > > > > > > > >
> > > > > > > > > > > Our unit tests are primarily quick tests to make sure
> > that
> > > we
> > > > > > > > exercise
> > > > > > > > > all
> > > > > > > > > > > critical paths in the code and that key branches are
> > > executed
> > > > > > > > > correctly. It
> > > > > > > > > > > is important that they execute relatively fast as they
> > are
> > > > > > generally
> > > > > > > > > run on
> > > > > > > > > > > every code build. The sketches-core repository alone
> has
> > > > about
> > > > > 22
> > > > > > > > > thousand
> > > > > > > > > > > statements, over 1300 unit tests and code coverage of
> > about
> > > > > > 98.2% as
> > > > > > > > > > > measured by Atlassian/Clover.  It is our goal for all
> of
> > > our
> > > > > code
> > > > > > > > > > > repositories that are used in production that they have
> > > code
> > > > > > coverage
> > > > > > > > > > > greater than 90%.
> > > > > > > > > > >
> > > > > > > > > > > ===== Characterization Testing =====
> > > > > > > > > > >
> > > > > > > > > > > In order to test the probabilistic methods that are
> used
> > to
> > > > > > interpret
> > > > > > > > > the
> > > > > > > > > > > stochastic behaviors of our sketches we have a separate
> > > > > > > > > characterization
> > > > > > > > > > > repository that is dedicated to this.  To measure
> > accuracy,
> > > > for
> > > > > > > > > example,
> > > > > > > > > > > requires running thousands of trials at each of many
> > > > different
> > > > > > points
> > > > > > > > > along
> > > > > > > > > > > the domain axis. Each trial compares its estimated
> > results
> > > > > > against a
> > > > > > > > > known
> > > > > > > > > > > exact result producing an error for that trial.  These
> > > error
> > > > > > > > > measurements
> > > > > > > > > > > are then fed into our Quantiles sketch to capture the
> > > actual
> > > > > > > > > distribution
> > > > > > > > > > > of error at that point along the axis. We then select
> > > > quantile
> > > > > > > > contours
> > > > > > > > > > > across all the distributions at points along the axis.
> > > These
> > > > > > > > contours
> > > > > > > > > can
> > > > > > > > > > > then be plotted to reveal the shape of the actual error
> > > > > > distribution.
> > > > > > > > > These
> > > > > > > > > > > distributions are not at all Gaussian, in fact they can
> > be
> > > > > quite
> > > > > > > > > complex.
> > > > > > > > > > > Nonetheless, these distributions are then checked
> against
> > > our
> > > > > > > > > statistical
> > > > > > > > > > > guarantees inherent to the specific sketch algorithm
> and
> > > its
> > > > > > > > > parameters.
> > > > > > > > > > > There are many examples of these characterization error
> > > > > > distributions
> > > > > > > > > on
> > > > > > > > > > > our website. The runtimes of these tests can be very
> long
> > > and
> > > > > can
> > > > > > > > range
> > > > > > > > > > > from many minutes to hours, and some can run for days.
> > > > > > Currently, we
> > > > > > > > > have
> > > > > > > > > > > separate characterization repositories for Java and
> C++ /
> > > > > Python.
> > > > > > > > > > >
> > > > > > > > > > > It is our goal that we perform this characterization
> > > analysis
> > > > > > for all
> > > > > > > > > of
> > > > > > > > > > > our sketches.  By definition, the code that runs these
> > > > > > > > characterization
> > > > > > > > > > > tests is open-source so others can run these tests as
> > well.
> > > > We
> > > > > > do
> > > > > > > > not
> > > > > > > > > have
> > > > > > > > > > > formal releases of this code (because it is not
> > production
> > > > > code)
> > > > > > and
> > > > > > > > > it is
> > > > > > > > > > > not published to Maven Central.
> > > > > > > > > > >
> > > > > > > > > > > === Meritocracy ===
> > > > > > > > > > >
> > > > > > > > > > > DataSketches was initially developed based on
> > requirements
> > > > > within
> > > > > > > > > Yahoo. As
> > > > > > > > > > > a project on GitHub, DataSketches has received
> > > contributions
> > > > > from
> > > > > > > > > numerous
> > > > > > > > > > > individual developers from around the world, dedicated
> > > > research
> > > > > > work
> > > > > > > > > from
> > > > > > > > > > > senior scientists at Amazon and Visa, and academic
> > > > researchers
> > > > > > from
> > > > > > > > > > > Georgetown University, Princeton, and MIT.
> > > > > > > > > > >
> > > > > > > > > > > As a project under incubation, we are committed to
> > > expanding
> > > > > our
> > > > > > > > > effort to
> > > > > > > > > > > build an environment which supports a meritocracy. We
> are
> > > > > > focused on
> > > > > > > > > > > engaging the community and other related projects for
> > > support
> > > > > and
> > > > > > > > > > > contributions. Moreover, we are committed to ensure
> > > > > contributors
> > > > > > and
> > > > > > > > > > > committers to DataSketches come from a broad mix of
> > > > > organizations
> > > > > > > > > through a
> > > > > > > > > > > merit-based decision process during incubation. We
> > believe
> > > > > > strongly
> > > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > DataSketches premise that fulfills the concept of a
> well
> > > > > > engineered
> > > > > > > > and
> > > > > > > > > > > scientifically rigorous library that implements these
> > > > powerful
> > > > > > > > > algorithms
> > > > > > > > > > > and are committed to growing an inclusive community of
> > > > > > DataSketches
> > > > > > > > > > > contributors and users.
> > > > > > > > > > >
> > > > > > > > > > > === Community ===
> > > > > > > > > > >
> > > > > > > > > > > Yahoo has a long history and active engagement in the
> > Open
> > > > > Source
> > > > > > > > > > > community. Major projects include: Vespa.ai, Bullet,
> > > Moloch,
> > > > > > > > Panoptes,
> > > > > > > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
> > > > > TensorFlowOnSpark,
> > > > > > > > > gifshot,
> > > > > > > > > > > fluxible, as well as the creation, contribution and
> > > > incubation
> > > > > of
> > > > > > > > many
> > > > > > > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper,
> > > > Oozie,
> > > > > > > > > Zookeeper,
> > > > > > > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many
> > more.
> > > > > > > > > > >
> > > > > > > > > > > Every day, DataSketches is actively used by a
> > organizations
> > > > and
> > > > > > > > > > > institutions around the world for batch and stream
> > > processing
> > > > > of
> > > > > > > > data.
> > > > > > > > > We
> > > > > > > > > > > believe acceptance will allow us to consolidate
> existing
> > > > > > > > > > > DataSketches-related work, grow the DataSketches
> > community,
> > > > and
> > > > > > > > deepen
> > > > > > > > > > > connections between DataSketches and other open source
> > > > > projects.
> > > > > > > > > > >
> > > > > > > > > > > === Introduction to the Core Developers & Contributors
> > ===
> > > > > > > > > > >
> > > > > > > > > > > The core developers and contributors for DataSketches
> are
> > > > from
> > > > > > > > diverse
> > > > > > > > > > > backgrounds, but primarily are scientists that love
> > > > engineering
> > > > > > and
> > > > > > > > > > > engineers that love science. A large part of the value
> we
> > > > bring
> > > > > > comes
> > > > > > > > > from
> > > > > > > > > > > this synthesis.  These individuals have already
> > contributed
> > > > > > > > > substantially
> > > > > > > > > > > to the code, algorithms, and/or mathematical proofs
> that
> > > form
> > > > > the
> > > > > > > > > basis of
> > > > > > > > > > > the library.
> > > > > > > > > > >
> > > > > > > > > > > This core group also form the Initial Committers with
> > write
> > > > > > > > > permissions to
> > > > > > > > > > > the repository. Those marked with (*) Meet weekly to
> plan
> > > the
> > > > > > > > research
> > > > > > > > > and
> > > > > > > > > > > engineering direction of the project.
> > > > > > > > > > >
> > > > > > > > > > > ==== Scientists That Love Engineering ====
> > > > > > > > > > >
> > > > > > > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs,
> > > > Israel.
> > > > > > > > > Interests:
> > > > > > > > > > > distributed systems, scalable systems and platforms for
> > big
> > > > > data
> > > > > > > > > > > processing, concurrent algorithms and data structures,
> > > > > > > > > > >
> > > > > > > > > > > * Kevin Lang: (*) Distinguished Research Scientist,
> Yahoo
> > > > Labs,
> > > > > > > > > Sunnyvale,
> > > > > > > > > > > California. Interests: algorithms, theoretical and
> > applied
> > > > > > > > mathematics,
> > > > > > > > > > > encoding and compression theory, theoretical and
> applied
> > > > > > performance
> > > > > > > > > > > optimization.
> > > > > > > > > > >
> > > > > > > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon
> > AI
> > > > > Labs,
> > > > > > Palo
> > > > > > > > > Alto,
> > > > > > > > > > > California. Manages the algorithms group at Amazon AI.
> We
> > > > build
> > > > > > > > > scalable
> > > > > > > > > > > machine learning systems and algorithms which are used
> > both
> > > > > > > > internally
> > > > > > > > > and
> > > > > > > > > > > externally by customers of SageMaker, AWS's flagship
> > > machine
> > > > > > learning
> > > > > > > > > > > platform.
> > > > > > > > > > >
> > > > > > > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs,
> > Sunnyvale.
> > > > > > Interests:
> > > > > > > > > > > Computational advertising, machine learning, speech
> > > > > recognition,
> > > > > > > > > > > data-driven analysis, large scale experimentation, big
> > > data,
> > > > > > > > > stream/complex
> > > > > > > > > > > event processing
> > > > > > > > > > >
> > > > > > > > > > > * Justin Thaler: (*) Assistant Professor, Department of
> > > > > Computer
> > > > > > > > > Science,
> > > > > > > > > > > Georgetown University, Washington D.C. Interests:
> > > algorithms
> > > > > and
> > > > > > > > > > > computational complexity, complexity theory, quantum
> > > > > algorithms,
> > > > > > > > > private
> > > > > > > > > > > data analysis, and learning theory, developing
> efficient
> > > > > > streaming
> > > > > > > > and
> > > > > > > > > > > sketching algorithms
> > > > > > > > > > >
> > > > > > > > > > > ==== Engineers That Love Science ====
> > > > > > > > > > >
> > > > > > > > > > > * Roman Leventov: Senior Software Engineer,
> Metamarkets
> > /
> > > > > Snap.
> > > > > > > > > Interests:
> > > > > > > > > > > design and implementation of data storing and data
> > > processing
> > > > > > > > > (distributed)
> > > > > > > > > > > systems, performance optimization, CPU performance,
> > > > mechanical
> > > > > > > > > sympathy,
> > > > > > > > > > > JVM performance, API design, databases, (concurrent)
> data
> > > > > > structures,
> > > > > > > > > > > memory management, garbage collection algorithms,
> > language
> > > > > > design and
> > > > > > > > > > > runtimes (their tradeoffs), distributed systems (cloud)
> > > > > > efficiency,
> > > > > > > > > Linux,
> > > > > > > > > > > code quality, code transformation, pure functional
> > > > programming
> > > > > > > > models,
> > > > > > > > > > > Haskell.
> > > > > > > > > > >
> > > > > > > > > > > * Lee Rhodes: (*) Distinguished Architect, lead
> developer
> > > and
> > > > > > founder
> > > > > > > > > of
> > > > > > > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
> > > > > > Interests:
> > > > > > > > > > > streaming algorithms, mathematics, computer science,
> high
> > > > > > quality and
> > > > > > > > > high
> > > > > > > > > > > performance code for the analysis of massive data,
> > bridging
> > > > the
> > > > > > > > divide
> > > > > > > > > > > between theory and practice.
> > > > > > > > > > >
> > > > > > > > > > > * Alexander Saydakov: (*) Senior Software Engineer,
> > Yahoo,
> > > > > > Sunnyvale,
> > > > > > > > > > > California. Interests: applied mathematics, computer
> > > science,
> > > > > big
> > > > > > > > data,
> > > > > > > > > > > distributed systems.
> > > > > > > > > > >
> > > > > > > > > > > === Introduction to Additional Interested Contributors
> > ===
> > > > > > > > > > >
> > > > > > > > > > > These folks have been intermittently involved and
> > > > contributed,
> > > > > > but
> > > > > > > > are
> > > > > > > > > > > strong supporters of this project.
> > > > > > > > > > >
> > > > > > > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > > > > > > > >
> > > > > > > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D.
> > > > > Computer
> > > > > > > > > Science,
> > > > > > > > > > > Univ of Utah. Interests: Machine Learning, Data Mining,
> > > > matrix
> > > > > > > > > > > approximation, streaming algorithms, randomized linear
> > > > algebra.
> > > > > > > > > > >
> > > > > > > > > > > * Christopher Musco: [christopher.musco at gmail dot
> com]
> > > > Ph.D.
> > > > > > > > > Computer
> > > > > > > > > > > Science, Research Instructor, Princeton University.
> > > > Interests:
> > > > > > > > > algorithmic
> > > > > > > > > > > foundations of data science and machine learning,
> > efficient
> > > > > > methods
> > > > > > > > for
> > > > > > > > > > > processing and understanding large datasets, often
> > working
> > > at
> > > > > the
> > > > > > > > > > > intersection of theoretical computer science, numerical
> > > > linear
> > > > > > > > > algebra, and
> > > > > > > > > > > optimization.
> > > > > > > > > > >
> > > > > > > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk]
> Ph.D.
> > > > > > Computer
> > > > > > > > > Science,
> > > > > > > > > > > Professor, Warwick University, Warwick, England.
> > Interests:
> > > > all
> > > > > > > > > aspects of
> > > > > > > > > > > the "data lifecycle", from data collection and
> cleaning,
> > > > > through
> > > > > > > > > mining and
> > > > > > > > > > > analytics. (Professor Cormode is one of the world’s
> > leading
> > > > > > > > scientists
> > > > > > > > > in
> > > > > > > > > > > sketching algorithms)
> > > > > > > > > > >
> > > > > > > > > > > === Alignment ===
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library already provides integrations
> > and
> > > > > > example
> > > > > > > > > code for
> > > > > > > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply
> > > > integrated
> > > > > > into
> > > > > > > > > Apache
> > > > > > > > > > > Druid.
> > > > > > > > > > >
> > > > > > > > > > > == Known Risks ==
> > > > > > > > > > >
> > > > > > > > > > > The following subsections are specific risks that have
> > been
> > > > > > > > identified
> > > > > > > > > by
> > > > > > > > > > > the ASF that need to be addressed.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Orphaned Products ===
> > > > > > > > > > >
> > > > > > > > > > > The DataSketches library is presently used by a number
> of
> > > > > > > > > organizations,
> > > > > > > > > > > from small startups to Fortune 100 companies, to
> > construct
> > > > > > production
> > > > > > > > > > > pipelines that must process and analyze massive data.
> > Yahoo
> > > > > has a
> > > > > > > > > long-term
> > > > > > > > > > > commitment to continue to advance the DataSketches
> > library;
> > > > > > moreover,
> > > > > > > > > > > DataSketches is seeing increasing interest,
> development,
> > > and
> > > > > > adoption
> > > > > > > > > from
> > > > > > > > > > > many diverse organizations from around the world. Due
> to
> > > its
> > > > > > growing
> > > > > > > > > > > adoption, we feel it is quite unlikely that this
> project
> > > > would
> > > > > > become
> > > > > > > > > > > orphaned.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Inexperience with Open Source ===
> > > > > > > > > > >
> > > > > > > > > > > Yahoo believes strongly in open source and the exchange
> > of
> > > > > > > > information
> > > > > > > > > to
> > > > > > > > > > > advance new ideas and work. Examples of this commitment
> > are
> > > > > > active
> > > > > > > > open
> > > > > > > > > > > source projects such as those mentioned above. With
> > > > > > DataSketches, we
> > > > > > > > > have
> > > > > > > > > > > been increasingly open and forward-looking; we have
> > > > published a
> > > > > > > > number
> > > > > > > > > of
> > > > > > > > > > > papers about breakthrough developments in the science
> of
> > > > > > streaming
> > > > > > > > > > > algorithms (mentioned above) that also reference the
> > > > > DataSketches
> > > > > > > > > library.
> > > > > > > > > > > Our submission to the Apache Software Foundation is a
> > > logical
> > > > > > > > > extension of
> > > > > > > > > > > our commitment to open source software.
> > > > > > > > > > >
> > > > > > > > > > > Key committers at Yahoo with strong open source
> > backgrounds
> > > > > > include
> > > > > > > > > Aaron
> > > > > > > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia
> > > > Braginsky,
> > > > > > > > Andrews
> > > > > > > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen,
> > Bryan
> > > > > Call,
> > > > > > > > Daryn
> > > > > > > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric
> Payne,
> > > > Eshcar
> > > > > > > > Hillel,
> > > > > > > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > > > > > > > Perez-Sorrosal,
> > > > > > > > > Gil
> > > > > > > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai
> > Asher,
> > > > > James
> > > > > > > > > Penick,
> > > > > > > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis,
> Jon
> > > > > Eagles,
> > > > > > > > > Kihwal
> > > > > > > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla,
> > Michael
> > > > > > Trelinski,
> > > > > > > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham,
> Olga
> > L.
> > > > > > > > Natkovich,
> > > > > > > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini
> Palaniswamy,
> > > > Ruby
> > > > > > Loo,
> > > > > > > > > Ryan
> > > > > > > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley,
> Shu
> > > Kit
> > > > > > Chan,
> > > > > > > > Sri
> > > > > > > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and
> > many
> > > > > more.
> > > > > > > > > > >
> > > > > > > > > > > All of our core developers are committed to learn about
> > the
> > > > > > Apache
> > > > > > > > > process
> > > > > > > > > > > and to give back to the community.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Homogeneous Developers ===
> > > > > > > > > > >
> > > > > > > > > > > The majority of committers in this proposal belong to
> > Yahoo
> > > > due
> > > > > > to
> > > > > > > > the
> > > > > > > > > fact
> > > > > > > > > > > that DataSketches has emerged from an internal Yahoo
> > > project.
> > > > > > This
> > > > > > > > > proposal
> > > > > > > > > > > also includes developers and contributors from other
> > > > companies,
> > > > > > and
> > > > > > > > > who are
> > > > > > > > > > > actively involved with other Apache projects, such as
> > > Druid.
> > > > > We
> > > > > > > > > expect our
> > > > > > > > > > > entry into incubation will allow us to expand the
> number
> > of
> > > > > > > > > individuals and
> > > > > > > > > > > organizations participating in DataSketches
> development.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Reliance on Salaried Developers ===
> > > > > > > > > > >
> > > > > > > > > > > Because the DataSketches library originated within
> Yahoo,
> > > it
> > > > > has
> > > > > > been
> > > > > > > > > > > developed primarily by salaried Yahoo developers and we
> > > > expect
> > > > > > that
> > > > > > > > to
> > > > > > > > > > > continue to be the case near term. However, since we
> > placed
> > > > > this
> > > > > > > > > library
> > > > > > > > > > > into open-source we have had a number of significant
> > > > > > contributions
> > > > > > > > from
> > > > > > > > > > > engineers and scientists from outside of Yahoo. We
> expect
> > > our
> > > > > > > > reliance
> > > > > > > > > on
> > > > > > > > > > > Yahoo salaried developers will decrease over time.
> > > > Nonetheless,
> > > > > > Yahoo
> > > > > > > > > is
> > > > > > > > > > > committed to continue its strong support of this
> > important
> > > > > > project.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Lack of Relationship to other Apache Products
> > ===
> > > > > > > > > > >
> > > > > > > > > > > DataSketches already directly interoperates with or
> > > utilizes
> > > > > > several
> > > > > > > > > > > existing Apache projects.
> > > > > > > > > > >
> > > > > > > > > > > * Build
> > > > > > > > > > >    * Apache Maven
> > > > > > > > > > >
> > > > > > > > > > > * Integrations and adaptors for the following projects
> > > > > naturally
> > > > > > have
> > > > > > > > > them
> > > > > > > > > > > as dependencies
> > > > > > > > > > >    * Apache Hive
> > > > > > > > > > >    * Apache Pig
> > > > > > > > > > >    * Apache Druid
> > > > > > > > > > >    * Apache Spark
> > > > > > > > > > >
> > > > > > > > > > > * Additional dependencies for the above integrations
> and
> > > > > adaptors
> > > > > > > > > include
> > > > > > > > > > >    * Apache Hadoop
> > > > > > > > > > >    * Apache Commons (Math)
> > > > > > > > > > >
> > > > > > > > > > > There is no other Apache project that we are aware of
> > that
> > > > > > duplicates
> > > > > > > > > the
> > > > > > > > > > > functionality of the DataSketches library.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: An Excessive Fascination with the Apache
> Brand
> > > ===
> > > > > > > > > > >
> > > > > > > > > > > With this proposal we are not seeking attention or
> > > publicity.
> > > > > > Rather,
> > > > > > > > > we
> > > > > > > > > > > firmly believe in the DataSketches library and concept
> > and
> > > > the
> > > > > > > > ability
> > > > > > > > > to
> > > > > > > > > > > make the DataSketches library a powerful, yet
> > simple-to-use
> > > > > > toolkit
> > > > > > > > for
> > > > > > > > > > > data processing. While the DataSketches library has
> been
> > > open
> > > > > > source,
> > > > > > > > > we
> > > > > > > > > > > believe putting code on GitHub can only go so far. We
> see
> > > the
> > > > > > Apache
> > > > > > > > > > > community, processes, and mission as critical for
> > ensuring
> > > > the
> > > > > > > > > DataSketches
> > > > > > > > > > > library is truly community-driven, positively
> impactful,
> > > and
> > > > > > > > innovative
> > > > > > > > > > > open source software. While Yahoo has taken a number of
> > > steps
> > > > > to
> > > > > > > > > advance
> > > > > > > > > > > its various open source projects, we believe the
> > > DataSketches
> > > > > > library
> > > > > > > > > > > project is a great fit for the Apache Software
> Foundation
> > > due
> > > > > to
> > > > > > its
> > > > > > > > > focus
> > > > > > > > > > > on data processing and its relationships to existing
> ASF
> > > > > > projects.
> > > > > > > > > > >
> > > > > > > > > > > === Risk: Cryptography ===
> > > > > > > > > > >
> > > > > > > > > > > DataSketches does not contain any cryptographic code
> and
> > is
> > > > > not a
> > > > > > > > > > > cryptographic product.
> > > > > > > > > > >
> > > > > > > > > > > == Documentation ==
> > > > > > > > > > >
> > > > > > > > > > > The following documentation is relevant to this
> proposal.
> > > > > > Relevant
> > > > > > > > > portions
> > > > > > > > > > > of the documentation will be contributed to the Apache
> > > > > > DataSketches
> > > > > > > > > > > project.
> > > > > > > > > > >
> > > > > > > > > > > * DataSketches website: https://datasketches.github.io
> .
> > > > > > > > > > >
> > > > > > > > > > > * DataSketches website repository:
> > > > > > > > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > > > > > > > >
> > > > > > > > > > > We will need an apache website for this documentation
> > > similar
> > > > > to
> > > > > > > > > > >
> > > > > > > > > > > * https://datasketches.apache.org
> > > > > > > > > > >
> > > > > > > > > > > == Initial Source ==
> > > > > > > > > > >
> > > > > > > > > > > The initial source for DataSketches which we will
> submit
> > to
> > > > the
> > > > > > > > Apache
> > > > > > > > > > > Foundation will include a number of repositories which
> > are
> > > > > > currently
> > > > > > > > > hosted
> > > > > > > > > > > under the GitHub.com/datasketches organization:
> > > > > > > > > > >
> > > > > > > > > > > All github.com/datasketches repositories including:
> > > > > > > > > > >
> > > > > > > > > > > * Java
> > > > > > > > > > >    * sketches-core: This repository has the core
> > sketching
> > > > > > classes,
> > > > > > > > > which
> > > > > > > > > > > are leveraged by some of the other repositories. This
> > > > > repository
> > > > > > has
> > > > > > > > no
> > > > > > > > > > > external dependencies outside of the
> DataSketches/memory
> > > > > > repository,
> > > > > > > > > Java
> > > > > > > > > > > and TestNG for unit tests. This code is versioned and
> the
> > > > > latest
> > > > > > > > > release
> > > > > > > > > > > can be obtained from Maven Central.
> > > > > > > > > > >    * memory: Low level, high-performance memory
> > > > data-structure
> > > > > > > > > management
> > > > > > > > > > > primarily for off-heap.
> > > > > > > > > > >    * sketches-android: This is a new repository
> dedicated
> > > to
> > > > > > sketches
> > > > > > > > > > > designed to be run in a mobile client, such as a cell
> > > phone.
> > > > It
> > > > > > is
> > > > > > > > > still in
> > > > > > > > > > > development and should be considered experimental.
> > > > > > > > > > >    * sketches-hive: This repository contains Hive UDFs
> > and
> > > > > UDAFs
> > > > > > for
> > > > > > > > > use
> > > > > > > > > > > within Hadoop grid environments. This code has
> > dependencies
> > > > on
> > > > > > > > > > > sketches-core as well as Hadoop and Hive. Users of this
> > > code
> > > > > are
> > > > > > > > > advised to
> > > > > > > > > > > use Maven to bring in all the required dependencies.
> This
> > > > code
> > > > > is
> > > > > > > > > versioned
> > > > > > > > > > > and the latest release can be obtained from Maven
> > Central.
> > > > > > > > > > >    * sketches-pig: This repository contains Pig User
> > > Defined
> > > > > > > > Functions
> > > > > > > > > > > (UDF) for use within Hadoop grid environments. This
> code
> > > has
> > > > > > > > > dependencies
> > > > > > > > > > > on sketches-core as well as Hadoop and Pig. Users of
> this
> > > > code
> > > > > > are
> > > > > > > > > advised
> > > > > > > > > > > to use Maven to bring in all the required dependencies.
> > > This
> > > > > > code is
> > > > > > > > > > > versioned and the latest release can be obtained from
> > Maven
> > > > > > Central.
> > > > > > > > > > >    * sketches-vector: This is a new repository
> dedicated
> > to
> > > > > > sketches
> > > > > > > > > for
> > > > > > > > > > > vector and matrix operations. It is still somewhat
> > > > > experimental.
> > > > > > > > > > >    * characterization: This relatively new repository
> is
> > > for
> > > > > code
> > > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > use to characterize the accuracy and speed performance
> of
> > > the
> > > > > > > > sketches
> > > > > > > > > in
> > > > > > > > > > > the library and is constantly being updated. Examples
> of
> > > the
> > > > > job
> > > > > > > > > command
> > > > > > > > > > > files used for various tests can be found in the
> > > > > > src/main/resources
> > > > > > > > > > > directory. Some of these tests can run for hours
> > depending
> > > on
> > > > > its
> > > > > > > > > > > configuration.
> > > > > > > > > > >    * experimental: This repository is an experimental
> > > staging
> > > > > > area
> > > > > > > > for
> > > > > > > > > code
> > > > > > > > > > > that will eventually end up in another repository. This
> > > code
> > > > is
> > > > > > not
> > > > > > > > > > > versioned and not registered with Maven Central.
> > > > > > > > > > >    * sketches-misc: Demos and other code not related to
> > > > > > production
> > > > > > > > > > > deployment
> > > > > > > > > > >
> > > > > > > > > > > * C++ and Python
> > > > > > > > > > >    * sketches-core-cpp: This is the C++/Python
> companion
> > to
> > > > the
> > > > > > Java
> > > > > > > > > > > sketches-core. These implementations are binary
> > compatible
> > > > with
> > > > > > their
> > > > > > > > > > > counterparts in Java. In other words, a sketch created
> > and
> > > > > > stored in
> > > > > > > > > C++
> > > > > > > > > > > can be opened and read in Java and visa-versa. This
> site
> > > also
> > > > > > has our
> > > > > > > > > > > Python adaptors that basically wrap the C++
> > > implementations,
> > > > > > making
> > > > > > > > the
> > > > > > > > > > > high performance C++ implementations available from
> > Python.
> > > > > > > > > > >    * sketches-postgres: This site provides the
> > > > > postgres-specific
> > > > > > > > > adaptors
> > > > > > > > > > > that wrap the C++ implementations making them available
> > to
> > > > the
> > > > > > > > Postgres
> > > > > > > > > > > database users.
> > > > > > > > > > >    * characterization-cpp: This is the C++/Python
> > companion
> > > > to
> > > > > > the
> > > > > > > > Java
> > > > > > > > > > > characterization repository.
> > > > > > > > > > >    * experimental-cpp: This repository is an
> experimental
> > > > > staging
> > > > > > > > area
> > > > > > > > > for
> > > > > > > > > > > C++ code that will eventually end up in another
> > repository.
> > > > > > > > > > >
> > > > > > > > > > > * Command-Line Tools
> > > > > > > > > > >    * sketches-cmd
> > > > > > > > > > >    * homebrew-sketches
> > > > > > > > > > >    * homebrew-sketches-cmd
> > > > > > > > > > >
> > > > > > > > > > > These projects have always been Apache 2.0 licensed. We
> > > > intend
> > > > > to
> > > > > > > > > bundle
> > > > > > > > > > > all of these repositories since they are all
> > complementary
> > > > and
> > > > > > should
> > > > > > > > > be
> > > > > > > > > > > maintaine

-- 
>From my cell phone.

Re: DataSketches Proposal - Google Docs Link

Reply via email to