Re: DataSketches Proposal - Google Docs Link

Liang Chen Tue, 26 Feb 2019 16:02:17 -0800

Hi Kenneth

Please try this link :
https://docs.google.com/document/d/1_cnesVLtKqPeUYxJvsd_2MTFwgeC1wUqI6cDPCbBRSM/edit#heading=h.97rxea60t2yw


Regards
Liang


Kenneth Knowles wrote
> I could not access that document. I suggest you need to turn on link
> sharing.
> 
> Kenn
> 
> On Mon, Feb 25, 2019 at 12:00 PM 

> leerho@

>  &lt;

> leerho@

> &gt; wrote:
> 
>> Try this link:
>> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
>>
>>
>> On 2019/02/25 05:55:50, leerho &lt;

> leerho@

> &gt; wrote:
>> > Yes I will try that tomorrow.
>> >
>> > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles &lt;

> kenn@

> &gt; wrote:
>> >
>> > > Can you share the Google doc with the proposal? Per Ted's advice, we
>> can
>> > > iterate quickly there and move it to the wiki when it becomes a bit
>> more
>> > > stable.
>> > >
>> > > Kenn
>> > >
>> > > On Fri, Feb 22, 2019 at 10:21 PM 

> leerho@

>  &lt;

> leerho@

> &gt;
>> > > wrote:
>> > >
>> > > > Thanks for the offer.  i am a neophyte at this process and email
>> app!   I
>> > > > could use a lot of help getting this off the ground!  Also, I'm not
>> sure
>> > > > that Mr. Chen and Mr. Onofré have fully accepted taking this on :)
>> > > >
>> > > > Lee.
>> > > >
>> > > > On 2019/02/23 06:03:58, Kenneth Knowles &lt;

> kenn@

> &gt; wrote:
>> > > > > Nice.
>> > > > >
>> > > > > I would very much like to help mentor this project, though you
>> already
>> > > > have
>> > > > > a couple good ones.
>> > > > >
>> > > > > I concur with incubator as sponsoring entity.
>> > > > >
>> > > > > Kenn (VP Apache Beam)
>> > > > >
>> > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho &lt;

> leerho@

> &gt; wrote:
>> > > > >
>> > > > > > I didn't realize that this mail list does not accept PDF files,
>> > > > apparently
>> > > > > > only text.  So let me try one more time ... :)  Please let me
>> know if
>> > > > > > this works!
>> > > > > >
>> > > > > >
>> > > > > > = Apache DataSketches Proposal[1] =
>> > > > > >
>> > > > > > == Abstract ==
>> > > > > >
>> > > > > > DataSketches.GitHub.io is an open source, high-performance
>> library
>> > > of
>> > > > > > stochastic streaming algorithms commonly called "sketches" in
>> the
>> > > data
>> > > > > > sciences. Sketches are small, stateful programs that process
>> massive
>> > > > data
>> > > > > > as a stream and can provide approximate answers, with
>> mathematical
>> > > > > > guarantees, to computationally difficult queries
>> orders-of-magnitude
>> > > > faster
>> > > > > > than traditional, exact methods.
>> > > > > >
>> > > > > > This proposal is to move DataSketches to the Apache Software
>> > > > > > Foundation(ASF) transferring ownership of its copyright
>> intellectual
>> > > > > > property to the ASF.  Thereafter, DataSketches would be
>> officially
>> > > > known as
>> > > > > > Apache DataSketches and its evolution and governance would come
>> under
>> > > > the
>> > > > > > rules and guidance of the ASF.
>> > > > > >
>> > > > > > == Introduction ==
>> > > > > >
>> > > > > > The DataSketches library contains carefully crafted
>> implementations
>> > > of
>> > > > > > sketch algorithms that meet rigorous standards of quality and
>> > > > performance
>> > > > > > and provide capabilities required for large-scale production
>> systems
>> > > > that
>> > > > > > must process and analyze massive data. The DataSketches core
>> > > > repository is
>> > > > > > written in Java with a parallel core repository written in C++
>> that
>> > > > > > includes Python wrappers. The DataSketches library also
>> includes
>> > > > special
>> > > > > > repositories for extending the core library for Apache Hive and
>> > > Apache
>> > > > Pig.
>> > > > > > The sketches developed in the different languages share a
>> common
>> > > binary
>> > > > > > storage format so that sketches created and stored in Java, for
>> > > > example,
>> > > > > > can be fully used in C++, and visa versa.  Because the stored
>> sketch
>> > > > > > "images" are just a "blob" of bytes (similar to picture
>> images),
>> they
>> > > > can
>> > > > > > be shared across many different systems, languages and
>> platforms.
>> > > > > >
>> > > > > > The DataSketches documentation website,
>> > > https://datasketches.github.io
>> > > > ,
>> > > > > > includes general tutorials, a comprehensive research section
>> with
>> > > > > > references to relevant academic papers, extensive examples for
>> using
>> > > > the
>> > > > > > core library directly as well as examples for accessing the
>> library
>> > > in
>> > > > > > Hive, Pig, and Apache Spark.
>> > > > > >
>> > > > > > The DataSketches library also includes a characterization
>> repository
>> > > > for
>> > > > > > long running test programs that are used for studying accuracy
>> and
>> > > > > > performance of these sketches over wide ranges of input
>> variables.
>> > > The
>> > > > data
>> > > > > > produced by these programs is used for generating the many
>> > > performance
>> > > > > > plots contained in the documentation website and for academic
>> > > > > > publications.
>> > > > > >
>> > > > > > The code repositories used for production are versioned and
>> published
>> > > > to
>> > > > > > Maven Central on periodic intervals as the library evolves.
>> > > > > >
>> > > > > > The DataSketches library also includes several experimental
>> > > > repositories
>> > > > > > for use-cases outside the large-scale systems environments,
>> such
>> as
>> > > > > > sketches for mobile, IoT devices (Android), command-line access
>> of
>> > > the
>> > > > > > sketch library, and an experimental repository for vector-based
>> > > > sketches
>> > > > > > that performs approximate Singular Value Decomposition (SVD)
>> analysis
>> > > > that
>> > > > > > could potentially be used in Machine Learning (ML)
>> applications.
>> > > > > >
>> > > > > > == Background ==
>> > > > > >
>> > > > > > The DataSketches library was started in 2012 as internal Yahoo
>> > > project
>> > > > to
>> > > > > > dramatically reduce time and resources required for distinct
>> (unique)
>> > > > > > counting.  An extensive search on the Internet at the time
>> yielded a
>> > > > number
>> > > > > > of theoretical papers on stochastic streaming algorithms with
>> > > > pseudocode
>> > > > > > examples, but we did not find any usable open-source code of
>> the
>> > > > quality we
>> > > > > > felt we needed for our internal production systems.  So we
>> started a
>> > > > small
>> > > > > > project (one person) to develop our own sketches working
>> directly
>> > > from
>> > > > > > published theoretical papers.
>> > > > > >
>> > > > > > The DataSketches library was designed from the start with the
>> > > > objective of
>> > > > > > making these algorithms, usually only described in theoretical
>> > > papers,
>> > > > > > easily accessible to systems developers for use in our internal
>> > > > production
>> > > > > > systems. By necessity, the code had to be of the highest
>> quality
>> and
>> > > > > > thoroughly tested. The wide variety of our internal production
>> > > systems
>> > > > > > drove the requirement that the sketch implementations had to
>> have an
>> > > > > > absolute minimum of external, run-time dependencies in order to
>> > > > simplify
>> > > > > > integration and troubleshooting.
>> > > > > >
>> > > > > > Our internal experiments demonstrated dramatic positive impact
>> on the
>> > > > > > performance of our systems.  As a result, the DataSketches
>> library
>> > > > quickly
>> > > > > > evolved to include different types of sketches for different
>> types of
>> > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
>> algorithms,
>> > > > > > quantile/histogram algorithms, and weighted and unweighted
>> sampling
>> > > > > > algorithms.
>> > > > > >
>> > > > > > We quickly discovered that developing these sketch algorithms
>> to
>> be
>> > > > truly
>> > > > > > robust in production environments is quite difficult and
>> requires
>> > > deep
>> > > > > > understanding of the underlying mathematics and statistics as
>> well as
>> > > > > > extensive experience in developing high quality code for 24/7
>> > > > production
>> > > > > > systems. This is a difficult combination of skills for any one
>> > > > organization
>> > > > > > to collect and maintain over time. It became clear that this
>> > > technology
>> > > > > > needed a community larger than Yahoo to evolve.  In November,
>> 2015,
>> > > > this
>> > > > > > factor, along with Yahoo’s strong experience and support of
>> open
>> > > > source,
>> > > > > > led to the decision to open source this technology under an
>> Apache
>> > > 2.0
>> > > > > > license on GitHub. Since that time our community has expanded
>> > > > considerably
>> > > > > > and the key contributors to this effort includes leading
>> research
>> > > > > > scientists from a number of universities as well as
>> practitioners and
>> > > > > > researchers from a number of major corporations. The core of
>> this
>> > > > group is
>> > > > > > very active as we meet weekly to discuss research directions
>> and
>> > > > > > engineering priorities.
>> > > > > >
>> > > > > > It is important to note that our internal systems at Yahoo use
>> the
>> > > > current
>> > > > > > public GitHub open source DataSketches library and not an
>> internal
>> > > > version
>> > > > > > of the code.
>> > > > > >
>> > > > > > The close collaboration of scientific research and engineering
>> > > > development
>> > > > > > experience with actual massive-data processing systems has also
>> > > > produced
>> > > > > > new research publications in the field of stochastic streaming
>> > > > algorithms,
>> > > > > > for example:
>> > > > > >
>> > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee
>> > > > Rhodes, and
>> > > > > > Justin Thaler. A high-performance algorithm for identifying
>> frequent
>> > > > items
>> > > > > > in data streams. In ACM IMC 2017.
>> > > > > >
>> > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin
>> Thaler. A
>> > > > > > framework for estimating stream expression cardinalities. In
>> > > *EDBT/ICDT
>> > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
>> > > > > >
>> > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient
>> Frequent
>> > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
>> Proceedings
>> > > > ‘16,
>> > > > > > pages 845-854, 2016.
>> > > > > >
>> > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal
>> quantile
>> > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages
>> 71–78,
>> > > > 2016.
>> > > > > >
>> > > > > > * Kevin J Lang. Back to the future: an even more nearly optimal
>> > > > cardinality
>> > > > > > estimation algorithm. arXiv preprint
>> > > https://arxiv.org/abs/1708.06839,
>> > > > > > 2017.
>> > > > > >
>> > > > > > * Edo Liberty. Simple and deterministic matrix sketching. In
>> ACM
>> KDD
>> > > > > > Proceedings ‘13, pages 581– 588, 2013.
>> > > > > >
>> > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and
>> Jonathan
>> > > > Ullman.
>> > > > > > Space lower bounds for itemset frequency sketches. In ACM PODS
>> > > > Proceedings
>> > > > > > ‘16, pages 441–454, 2016.
>> > > > > >
>> > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
>> > > Hierarchical
>> > > > > > heavy hitters with the space saving algorithm. In SIAM ALENEX
>> > > > Proceedings
>> > > > > > ‘12, pages 160–174, 2012.
>> > > > > >
>> > > > > > == The Rationale for Sketches ==
>> > > > > >
>> > > > > > In the analysis of big data there are often problem queries
>> that
>> > > don’t
>> > > > > > scale because they require huge compute resources and time to
>> > > generate
>> > > > > > exact results. Examples include count distinct, quantiles, most
>> > > > frequent
>> > > > > > items, joins, matrix computations, and graph analysis.
>> > > > > >
>> > > > > > If we can loosen the requirement of “exact” results from our
>> queries
>> > > > and be
>> > > > > > satisfied with approximate results, within some well understood
>> > > bounds
>> > > > of
>> > > > > > error, there is an entire branch of mathematics and data
>> science
>> that
>> > > > has
>> > > > > > evolved around developing algorithms that can produce
>> approximate
>> > > > results
>> > > > > > with mathematically well-defined error properties.
>> > > > > >
>> > > > > > With the additional requirements that these algorithms must be
>> small
>> > > > > > (compared to the size of the input data), sublinear (the size
>> of
>> the
>> > > > sketch
>> > > > > > must grow at a slower rate than the size of the input stream),
>> > > > streaming
>> > > > > > (they can only touch each data item once), and mergeable
>> (suitable
>> > > for
>> > > > > > distributed processing), defines a class of algorithms that can
>> be
>> > > > > > described as small, stochastic, streaming, sublinear mergeable
>> > > > algorithms,
>> > > > > > commonly called sketches (they also have other names, but we
>> will use
>> > > > the
>> > > > > > term sketches from here on).
>> > > > > >
>> > > > > > To be truly streaming and be able to process data in a single
>> pass,
>> > > > > > sketches must make absolute minimum assumptions about the input
>> > > stream.
>> > > > > > This is critically important, as there is no “second chance” to
>> > > > process the
>> > > > > > data.
>> > > > > >
>> > > > > > For example, sketches should not make assumptions about the
>> order of
>> > > > stream
>> > > > > > items, the stream length, the dynamic range of values, or the
>> > > > distribution
>> > > > > > of item occurrence frequencies. Sketches should be tolerant of
>> NaNs,
>> > > > Nulls
>> > > > > > and empty objects. About the only thing that the sketch needs
>> to
>> know
>> > > > about
>> > > > > > the stream is how to extract items from it and what type the
>> item is,
>> > > > e.g.,
>> > > > > > is it a numeric value or a string.
>> > > > > >
>> > > > > > As far as the sketch is concerned, the input stream is a
>> sequence of
>> > > > items
>> > > > > > in some unknown random order with unknown random values.
>> > > > > >
>> > > > > > The sketch is essentially a complex state machine and combined
>> with
>> > > the
>> > > > > > random input stream defines a stochastic process. We then apply
>> > > > > > probabilistic methods to interpret the states of the stochastic
>> > > > process in
>> > > > > > order to extract useful information about the input stream
>> itself.
>> > > The
>> > > > > > resulting information will be approximate, but we also use
>> additional
>> > > > > > probabilistic methods to extract an estimate of the likely
>> > > probability
>> > > > > > distribution of error.
>> > > > > >
>> > > > > > There is a significant scientific contribution here that is
>> defining
>> > > > the
>> > > > > > state machine, understanding the resulting stochastic process,
>> > > > developing
>> > > > > > the probabilistic methods, and proving mathematically, that it
>> all
>> > > > works!
>> > > > > > This is why the scientific contributors to this project are a
>> > > critical
>> > > > and
>> > > > > > strategic component to our success.  The development engineers
>> > > > translate
>> > > > > > the concepts of the proposed state machine and probabilistic
>> methods
>> > > > into
>> > > > > > production-quality code. Even more important, they work closely
>> with
>> > > > the
>> > > > > > scientists, feeding back system and user requirements, which
>> leads
>> > > not
>> > > > only
>> > > > > > to superior product design, but to new science as well.  A
>> number of
>> > > > > > scientific papers our members have published (see above) is a
>> direct
>> > > > result
>> > > > > > of this close collaboration.
>> > > > > >
>> > > > > > Because sketches are small they can be processed extremely
>> fast,
>> > > often
>> > > > many
>> > > > > > orders-of-magnitude faster than traditional exact computations.
>> For
>> > > > > > interactive queries there may not be other viable alternatives,
>> and
>> > > in
>> > > > the
>> > > > > > case of real-time analysis, sketches are the only known
>> solution.
>> > > > > >
>> > > > > > For any system that needs to extract useful information from
>> massive
>> > > > data
>> > > > > > sketches are essential tools that should be tightly integrated
>> into
>> > > the
>> > > > > > system’s analysis capabilities. This technology has helped
>> Yahoo
>> > > > > > successfully reduce data processing times from days to hours or
>> > > > minutes on
>> > > > > > a number of its internal platforms and has enabled subsecond
>> queries
>> > > on
>> > > > > > real-time platforms that would have been infeasible without
>> sketches.
>> > > > > > The Rationale for Apache DataSketches
>> > > > > > Other open source implementations of sketch algorithms can be
>> found
>> > > on
>> > > > the
>> > > > > > Internet. However, we have not yet found any open source
>> > > > implementations
>> > > > > > that are as comprehensive, engineered with the quality required
>> for
>> > > > > > production systems, and with usable and guaranteed error
>> properties.
>> > > > Large
>> > > > > > Internet companies, such as Google and Facebook, have published
>> > > papers
>> > > > on
>> > > > > > sketching, however, their implementations of their published
>> > > > algorithms are
>> > > > > > proprietary and not available as open source.
>> > > > > >
>> > > > > > The DataSketches library already provides integrations with a
>> number
>> > > of
>> > > > > > major Apache data processing platforms such as Apache Hive,
>> Apache
>> > > Pig,
>> > > > > > Apache Spark and Apache Druid, and is also integrated with a
>> number
>> > > of
>> > > > > > other open source data processing platforms such as Splice
>> Machine,
>> > > > GCHQ
>> > > > > > Gaffer and PostgreSQL.
>> > > > > >
>> > > > > > We believe that having DataSketches as an Apache project will
>> provide
>> > > > an
>> > > > > > immediate, worthwhile, and substantial contribution to the open
>> > > source
>> > > > > > community, will have a better opportunity to provide a
>> meaningful
>> > > > > > contribution to both the science and engineering of sketching
>> > > > algorithms,
>> > > > > > and integrate with other Apache projects.  In addition, this is
>> a
>> > > > > > significant opportunity for Apache to be the "go-to"
>> destination
>> for
>> > > > users
>> > > > > > that want to leverage this exciting technology.
>> > > > > >
>> > > > > > == Initial Goals ==
>> > > > > >
>> > > > > > We are breaking our initial goals into short-term (2-6 months)
>> and
>> > > > > > intermediate to long-term ( 6 months to 2 years):
>> > > > > >
>> > > > > > Our short-term goals include:
>> > > > > >
>> > > > > > * Understanding and adapting to the Apache development process
>> and
>> > > > > > structures.
>> > > > > >
>> > > > > > * Start refactoring codebase and move various DataSketches
>> > > repositories
>> > > > > > code to Apache Git repository.
>> > > > > >
>> > > > > > * Continue development of new features, functions, and fixes.
>> > > > > >
>> > > > > > * Specific sub-projects (e.g., C++ and Python) will continue to
>> be
>> > > > > > developed and expanded.
>> > > > > >
>> > > > > >
>> > > > > > The intermediate to long term goals include:
>> > > > > >
>> > > > > > * Completing the design and implementation of the C++ sketches
>> to
>> > > > > > complement what is already available in Java, and the Python
>> wrappers
>> > > > of
>> > > > > > those C++ sketches.
>> > > > > >
>> > > > > > * Expanding the C++ build framework to include Windows and the
>> > > popular
>> > > > > > Linux variants.
>> > > > > >
>> > > > > > * Continued engagement with the scientific research community
>> on
>> the
>> > > > > > development of new algorithms for computationally difficult
>> problems
>> > > > that
>> > > > > > heretofore have not had a sketching solution.
>> > > > > >
>> > > > > > == Current Status ==
>> > > > > >
>> > > > > > The DataSketches GitHub project has been quite successful.  As
>> of
>> > > this
>> > > > > > writing (Feb, 2019) the number of downloads measured by the
>> Nexus
>> > > > > > Repository Manager at https://oss.sonatype.org has grown by
>> nearly a
>> > > > > > factor
>> > > > > > of 10 over the past year to about 55 thousand per month. The
>> > > > > > DataSketches/sketches-core repository has about 560 stars and
>> 141
>> > > > forks,
>> > > > > > which is pretty good for a highly specialized library.
>> > > > > >
>> > > > > > === Development Practices ===
>> > > > > >
>> > > > > > ==== Source Control ====
>> > > > > >
>> > > > > > All of our developers have extensive experience with Git
>> version
>> > > > control
>> > > > > > and follow accepted practices for use of Pull Requests (PRs),
>> code
>> > > > reviews
>> > > > > > and commits to master, for example.
>> > > > > >
>> > > > > > ==== Testing ====
>> > > > > >
>> > > > > > Sketches, by their nature are probabilistic programs and don’t
>> > > > necessarily
>> > > > > > behave deterministically.  For some of the sketches we
>> intentionally
>> > > > insert
>> > > > > > random noise into the code as this gives us the mathematical
>> > > properties
>> > > > > > that we need to guarantee accuracy.  This can make the behavior
>> of
>> > > > these
>> > > > > > algorithms quite unintuitive and provides significant
>> challenges
>> to
>> > > the
>> > > > > > developer who wishes to test these algorithms for correctness.
>> As a
>> > > > result,
>> > > > > > our testing strategy includes two major components: unit tests,
>> and
>> > > > > > characterization tests.
>> > > > > >
>> > > > > > ===== Unit Testing =====
>> > > > > >
>> > > > > > Our unit tests are primarily quick tests to make sure that we
>> > > exercise
>> > > > all
>> > > > > > critical paths in the code and that key branches are executed
>> > > > correctly. It
>> > > > > > is important that they execute relatively fast as they are
>> generally
>> > > > run on
>> > > > > > every code build. The sketches-core repository alone has about
>> 22
>> > > > thousand
>> > > > > > statements, over 1300 unit tests and code coverage of about
>> 98.2% as
>> > > > > > measured by Atlassian/Clover.  It is our goal for all of our
>> code
>> > > > > > repositories that are used in production that they have code
>> coverage
>> > > > > > greater than 90%.
>> > > > > >
>> > > > > > ===== Characterization Testing =====
>> > > > > >
>> > > > > > In order to test the probabilistic methods that are used to
>> interpret
>> > > > the
>> > > > > > stochastic behaviors of our sketches we have a separate
>> > > > characterization
>> > > > > > repository that is dedicated to this.  To measure accuracy, for
>> > > > example,
>> > > > > > requires running thousands of trials at each of many different
>> points
>> > > > along
>> > > > > > the domain axis. Each trial compares its estimated results
>> against a
>> > > > known
>> > > > > > exact result producing an error for that trial.  These error
>> > > > measurements
>> > > > > > are then fed into our Quantiles sketch to capture the actual
>> > > > distribution
>> > > > > > of error at that point along the axis. We then select quantile
>> > > contours
>> > > > > > across all the distributions at points along the axis.  These
>> > > contours
>> > > > can
>> > > > > > then be plotted to reveal the shape of the actual error
>> distribution.
>> > > > These
>> > > > > > distributions are not at all Gaussian, in fact they can be
>> quite
>> > > > complex.
>> > > > > > Nonetheless, these distributions are then checked against our
>> > > > statistical
>> > > > > > guarantees inherent to the specific sketch algorithm and its
>> > > > parameters.
>> > > > > > There are many examples of these characterization error
>> distributions
>> > > > on
>> > > > > > our website. The runtimes of these tests can be very long and
>> can
>> > > range
>> > > > > > from many minutes to hours, and some can run for days.
>> Currently, we
>> > > > have
>> > > > > > separate characterization repositories for Java and C++ /
>> Python.
>> > > > > >
>> > > > > > It is our goal that we perform this characterization analysis
>> for all
>> > > > of
>> > > > > > our sketches.  By definition, the code that runs these
>> > > characterization
>> > > > > > tests is open-source so others can run these tests as well.  We
>> do
>> > > not
>> > > > have
>> > > > > > formal releases of this code (because it is not production
>> code)
>> and
>> > > > it is
>> > > > > > not published to Maven Central.
>> > > > > >
>> > > > > > === Meritocracy ===
>> > > > > >
>> > > > > > DataSketches was initially developed based on requirements
>> within
>> > > > Yahoo. As
>> > > > > > a project on GitHub, DataSketches has received contributions
>> from
>> > > > numerous
>> > > > > > individual developers from around the world, dedicated research
>> work
>> > > > from
>> > > > > > senior scientists at Amazon and Visa, and academic researchers
>> from
>> > > > > > Georgetown University, Princeton, and MIT.
>> > > > > >
>> > > > > > As a project under incubation, we are committed to expanding
>> our
>> > > > effort to
>> > > > > > build an environment which supports a meritocracy. We are
>> focused on
>> > > > > > engaging the community and other related projects for support
>> and
>> > > > > > contributions. Moreover, we are committed to ensure
>> contributors
>> and
>> > > > > > committers to DataSketches come from a broad mix of
>> organizations
>> > > > through a
>> > > > > > merit-based decision process during incubation. We believe
>> strongly
>> > > in
>> > > > the
>> > > > > > DataSketches premise that fulfills the concept of a well
>> engineered
>> > > and
>> > > > > > scientifically rigorous library that implements these powerful
>> > > > algorithms
>> > > > > > and are committed to growing an inclusive community of
>> DataSketches
>> > > > > > contributors and users.
>> > > > > >
>> > > > > > === Community ===
>> > > > > >
>> > > > > > Yahoo has a long history and active engagement in the Open
>> Source
>> > > > > > community. Major projects include: Vespa.ai, Bullet, Moloch,
>> > > Panoptes,
>> > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
>> TensorFlowOnSpark,
>> > > > gifshot,
>> > > > > > fluxible, as well as the creation, contribution and incubation
>> of
>> > > many
>> > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie,
>> > > > Zookeeper,
>> > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
>> > > > > >
>> > > > > > Every day, DataSketches is actively used by a organizations and
>> > > > > > institutions around the world for batch and stream processing
>> of
>> > > data.
>> > > > We
>> > > > > > believe acceptance will allow us to consolidate existing
>> > > > > > DataSketches-related work, grow the DataSketches community, and
>> > > deepen
>> > > > > > connections between DataSketches and other open source
>> projects.
>> > > > > >
>> > > > > > === Introduction to the Core Developers & Contributors ===
>> > > > > >
>> > > > > > The core developers and contributors for DataSketches are from
>> > > diverse
>> > > > > > backgrounds, but primarily are scientists that love engineering
>> and
>> > > > > > engineers that love science. A large part of the value we bring
>> comes
>> > > > from
>> > > > > > this synthesis.  These individuals have already contributed
>> > > > substantially
>> > > > > > to the code, algorithms, and/or mathematical proofs that form
>> the
>> > > > basis of
>> > > > > > the library.
>> > > > > >
>> > > > > > This core group also form the Initial Committers with write
>> > > > permissions to
>> > > > > > the repository. Those marked with (*) Meet weekly to plan the
>> > > research
>> > > > and
>> > > > > > engineering direction of the project.
>> > > > > >
>> > > > > > ==== Scientists That Love Engineering ====
>> > > > > >
>> > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel.
>> > > > Interests:
>> > > > > > distributed systems, scalable systems and platforms for big
>> data
>> > > > > > processing, concurrent algorithms and data structures,
>> > > > > >
>> > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs,
>> > > > Sunnyvale,
>> > > > > > California. Interests: algorithms, theoretical and applied
>> > > mathematics,
>> > > > > > encoding and compression theory, theoretical and applied
>> performance
>> > > > > > optimization.
>> > > > > >
>> > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI
>> Labs,
>> Palo
>> > > > Alto,
>> > > > > > California. Manages the algorithms group at Amazon AI. We build
>> > > > scalable
>> > > > > > machine learning systems and algorithms which are used both
>> > > internally
>> > > > and
>> > > > > > externally by customers of SageMaker, AWS's flagship machine
>> learning
>> > > > > > platform.
>> > > > > >
>> > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale.
>> Interests:
>> > > > > > Computational advertising, machine learning, speech
>> recognition,
>> > > > > > data-driven analysis, large scale experimentation, big data,
>> > > > stream/complex
>> > > > > > event processing
>> > > > > >
>> > > > > > * Justin Thaler: (*) Assistant Professor, Department of
>> Computer
>> > > > Science,
>> > > > > > Georgetown University, Washington D.C. Interests: algorithms
>> and
>> > > > > > computational complexity, complexity theory, quantum
>> algorithms,
>> > > > private
>> > > > > > data analysis, and learning theory, developing efficient
>> streaming
>> > > and
>> > > > > > sketching algorithms
>> > > > > >
>> > > > > > ==== Engineers That Love Science ====
>> > > > > >
>> > > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets /
>> Snap.
>> > > > Interests:
>> > > > > > design and implementation of data storing and data processing
>> > > > (distributed)
>> > > > > > systems, performance optimization, CPU performance, mechanical
>> > > > sympathy,
>> > > > > > JVM performance, API design, databases, (concurrent) data
>> structures,
>> > > > > > memory management, garbage collection algorithms, language
>> design and
>> > > > > > runtimes (their tradeoffs), distributed systems (cloud)
>> efficiency,
>> > > > Linux,
>> > > > > > code quality, code transformation, pure functional programming
>> > > models,
>> > > > > > Haskell.
>> > > > > >
>> > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer and
>> founder
>> > > > of
>> > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
>> Interests:
>> > > > > > streaming algorithms, mathematics, computer science, high
>> quality and
>> > > > high
>> > > > > > performance code for the analysis of massive data, bridging the
>> > > divide
>> > > > > > between theory and practice.
>> > > > > >
>> > > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo,
>> Sunnyvale,
>> > > > > > California. Interests: applied mathematics, computer science,
>> big
>> > > data,
>> > > > > > distributed systems.
>> > > > > >
>> > > > > > === Introduction to Additional Interested Contributors ===
>> > > > > >
>> > > > > > These folks have been intermittently involved and contributed,
>> but
>> > > are
>> > > > > > strong supporters of this project.
>> > > > > >
>> > > > > > * Frank Grimes: GitHub ID: frankgrimes97
>> > > > > >
>> > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D.
>> Computer
>> > > > Science,
>> > > > > > Univ of Utah. Interests: Machine Learning, Data Mining, matrix
>> > > > > > approximation, streaming algorithms, randomized linear algebra.
>> > > > > >
>> > > > > > * Christopher Musco: [christopher.musco at gmail dot com] Ph.D.
>> > > > Computer
>> > > > > > Science, Research Instructor, Princeton University. Interests:
>> > > > algorithmic
>> > > > > > foundations of data science and machine learning, efficient
>> methods
>> > > for
>> > > > > > processing and understanding large datasets, often working at
>> the
>> > > > > > intersection of theoretical computer science, numerical linear
>> > > > algebra, and
>> > > > > > optimization.
>> > > > > >
>> > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D.
>> Computer
>> > > > Science,
>> > > > > > Professor, Warwick University, Warwick, England. Interests: all
>> > > > aspects of
>> > > > > > the "data lifecycle", from data collection and cleaning,
>> through
>> > > > mining and
>> > > > > > analytics. (Professor Cormode is one of the world’s leading
>> > > scientists
>> > > > in
>> > > > > > sketching algorithms)
>> > > > > >
>> > > > > > === Alignment ===
>> > > > > >
>> > > > > > The DataSketches library already provides integrations and
>> example
>> > > > code for
>> > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply integrated
>> into
>> > > > Apache
>> > > > > > Druid.
>> > > > > >
>> > > > > > == Known Risks ==
>> > > > > >
>> > > > > > The following subsections are specific risks that have been
>> > > identified
>> > > > by
>> > > > > > the ASF that need to be addressed.
>> > > > > >
>> > > > > > === Risk: Orphaned Products ===
>> > > > > >
>> > > > > > The DataSketches library is presently used by a number of
>> > > > organizations,
>> > > > > > from small startups to Fortune 100 companies, to construct
>> production
>> > > > > > pipelines that must process and analyze massive data. Yahoo has
>> a
>> > > > long-term
>> > > > > > commitment to continue to advance the DataSketches library;
>> moreover,
>> > > > > > DataSketches is seeing increasing interest, development, and
>> adoption
>> > > > from
>> > > > > > many diverse organizations from around the world. Due to its
>> growing
>> > > > > > adoption, we feel it is quite unlikely that this project would
>> become
>> > > > > > orphaned.
>> > > > > >
>> > > > > > === Risk: Inexperience with Open Source ===
>> > > > > >
>> > > > > > Yahoo believes strongly in open source and the exchange of
>> > > information
>> > > > to
>> > > > > > advance new ideas and work. Examples of this commitment are
>> active
>> > > open
>> > > > > > source projects such as those mentioned above. With
>> DataSketches, we
>> > > > have
>> > > > > > been increasingly open and forward-looking; we have published a
>> > > number
>> > > > of
>> > > > > > papers about breakthrough developments in the science of
>> streaming
>> > > > > > algorithms (mentioned above) that also reference the
>> DataSketches
>> > > > library.
>> > > > > > Our submission to the Apache Software Foundation is a logical
>> > > > extension of
>> > > > > > our commitment to open source software.
>> > > > > >
>> > > > > > Key committers at Yahoo with strong open source backgrounds
>> include
>> > > > Aaron
>> > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky,
>> > > Andrews
>> > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan
>> Call,
>> > > Daryn
>> > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar
>> > > Hillel,
>> > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
>> > > Perez-Sorrosal,
>> > > > Gil
>> > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher,
>> James
>> > > > Penick,
>> > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon
>> Eagles,
>> > > > Kihwal
>> > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael
>> Trelinski,
>> > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
>> > > Natkovich,
>> > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby
>> Loo,
>> > > > Ryan
>> > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit
>> Chan,
>> > > Sri
>> > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many
>> more.
>> > > > > >
>> > > > > > All of our core developers are committed to learn about the
>> Apache
>> > > > process
>> > > > > > and to give back to the community.
>> > > > > >
>> > > > > > === Risk: Homogeneous Developers ===
>> > > > > >
>> > > > > > The majority of committers in this proposal belong to Yahoo due
>> to
>> > > the
>> > > > fact
>> > > > > > that DataSketches has emerged from an internal Yahoo project.
>> This
>> > > > proposal
>> > > > > > also includes developers and contributors from other companies,
>> and
>> > > > who are
>> > > > > > actively involved with other Apache projects, such as Druid. 
>> We
>> > > > expect our
>> > > > > > entry into incubation will allow us to expand the number of
>> > > > individuals and
>> > > > > > organizations participating in DataSketches development.
>> > > > > >
>> > > > > > === Risk: Reliance on Salaried Developers ===
>> > > > > >
>> > > > > > Because the DataSketches library originated within Yahoo, it
>> has
>> been
>> > > > > > developed primarily by salaried Yahoo developers and we expect
>> that
>> > > to
>> > > > > > continue to be the case near term. However, since we placed
>> this
>> > > > library
>> > > > > > into open-source we have had a number of significant
>> contributions
>> > > from
>> > > > > > engineers and scientists from outside of Yahoo. We expect our
>> > > reliance
>> > > > on
>> > > > > > Yahoo salaried developers will decrease over time. Nonetheless,
>> Yahoo
>> > > > is
>> > > > > > committed to continue its strong support of this important
>> project.
>> > > > > >
>> > > > > > === Risk: Lack of Relationship to other Apache Products ===
>> > > > > >
>> > > > > > DataSketches already directly interoperates with or utilizes
>> several
>> > > > > > existing Apache projects.
>> > > > > >
>> > > > > > * Build
>> > > > > >    * Apache Maven
>> > > > > >
>> > > > > > * Integrations and adaptors for the following projects
>> naturally
>> have
>> > > > them
>> > > > > > as dependencies
>> > > > > >    * Apache Hive
>> > > > > >    * Apache Pig
>> > > > > >    * Apache Druid
>> > > > > >    * Apache Spark
>> > > > > >
>> > > > > > * Additional dependencies for the above integrations and
>> adaptors
>> > > > include
>> > > > > >    * Apache Hadoop
>> > > > > >    * Apache Commons (Math)
>> > > > > >
>> > > > > > There is no other Apache project that we are aware of that
>> duplicates
>> > > > the
>> > > > > > functionality of the DataSketches library.
>> > > > > >
>> > > > > > === Risk: An Excessive Fascination with the Apache Brand ===
>> > > > > >
>> > > > > > With this proposal we are not seeking attention or publicity.
>> Rather,
>> > > > we
>> > > > > > firmly believe in the DataSketches library and concept and the
>> > > ability
>> > > > to
>> > > > > > make the DataSketches library a powerful, yet simple-to-use
>> toolkit
>> > > for
>> > > > > > data processing. While the DataSketches library has been open
>> source,
>> > > > we
>> > > > > > believe putting code on GitHub can only go so far. We see the
>> Apache
>> > > > > > community, processes, and mission as critical for ensuring the
>> > > > DataSketches
>> > > > > > library is truly community-driven, positively impactful, and
>> > > innovative
>> > > > > > open source software. While Yahoo has taken a number of steps
>> to
>> > > > advance
>> > > > > > its various open source projects, we believe the DataSketches
>> library
>> > > > > > project is a great fit for the Apache Software Foundation due
>> to
>> its
>> > > > focus
>> > > > > > on data processing and its relationships to existing ASF
>> projects.
>> > > > > >
>> > > > > > === Risk: Cryptography ===
>> > > > > >
>> > > > > > DataSketches does not contain any cryptographic code and is not
>> a
>> > > > > > cryptographic product.
>> > > > > >
>> > > > > > == Documentation ==
>> > > > > >
>> > > > > > The following documentation is relevant to this proposal.
>> Relevant
>> > > > portions
>> > > > > > of the documentation will be contributed to the Apache
>> DataSketches
>> > > > > > project.
>> > > > > >
>> > > > > > * DataSketches website: https://datasketches.github.io.
>> > > > > >
>> > > > > > * DataSketches website repository:
>> > > > > > https://github.com/DataSketches/DataSketches.github.io
>> > > > > >
>> > > > > > We will need an apache website for this documentation similar
>> to
>> > > > > >
>> > > > > > * https://datasketches.apache.org
>> > > > > >
>> > > > > > == Initial Source ==
>> > > > > >
>> > > > > > The initial source for DataSketches which we will submit to the
>> > > Apache
>> > > > > > Foundation will include a number of repositories which are
>> currently
>> > > > hosted
>> > > > > > under the GitHub.com/datasketches organization:
>> > > > > >
>> > > > > > All github.com/datasketches repositories including:
>> > > > > >
>> > > > > > * Java
>> > > > > >    * sketches-core: This repository has the core sketching
>> classes,
>> > > > which
>> > > > > > are leveraged by some of the other repositories. This
>> repository
>> has
>> > > no
>> > > > > > external dependencies outside of the DataSketches/memory
>> repository,
>> > > > Java
>> > > > > > and TestNG for unit tests. This code is versioned and the
>> latest
>> > > > release
>> > > > > > can be obtained from Maven Central.
>> > > > > >    * memory: Low level, high-performance memory data-structure
>> > > > management
>> > > > > > primarily for off-heap.
>> > > > > >    * sketches-android: This is a new repository dedicated to
>> sketches
>> > > > > > designed to be run in a mobile client, such as a cell phone. It
>> is
>> > > > still in
>> > > > > > development and should be considered experimental.
>> > > > > >    * sketches-hive: This repository contains Hive UDFs and
>> UDAFs
>> for
>> > > > use
>> > > > > > within Hadoop grid environments. This code has dependencies on
>> > > > > > sketches-core as well as Hadoop and Hive. Users of this code
>> are
>> > > > advised to
>> > > > > > use Maven to bring in all the required dependencies. This code
>> is
>> > > > versioned
>> > > > > > and the latest release can be obtained from Maven Central.
>> > > > > >    * sketches-pig: This repository contains Pig User Defined
>> > > Functions
>> > > > > > (UDF) for use within Hadoop grid environments. This code has
>> > > > dependencies
>> > > > > > on sketches-core as well as Hadoop and Pig. Users of this code
>> are
>> > > > advised
>> > > > > > to use Maven to bring in all the required dependencies. This
>> code is
>> > > > > > versioned and the latest release can be obtained from Maven
>> Central.
>> > > > > >    * sketches-vector: This is a new repository dedicated to
>> sketches
>> > > > for
>> > > > > > vector and matrix operations. It is still somewhat
>> experimental.
>> > > > > >    * characterization: This relatively new repository is for
>> code
>> > > that
>> > > > we
>> > > > > > use to characterize the accuracy and speed performance of the
>> > > sketches
>> > > > in
>> > > > > > the library and is constantly being updated. Examples of the
>> job
>> > > > command
>> > > > > > files used for various tests can be found in the
>> src/main/resources
>> > > > > > directory. Some of these tests can run for hours depending on
>> its
>> > > > > > configuration.
>> > > > > >    * experimental: This repository is an experimental staging
>> area
>> > > for
>> > > > code
>> > > > > > that will eventually end up in another repository. This code is
>> not
>> > > > > > versioned and not registered with Maven Central.
>> > > > > >    * sketches-misc: Demos and other code not related to
>> production
>> > > > > > deployment
>> > > > > >
>> > > > > > * C++ and Python
>> > > > > >    * sketches-core-cpp: This is the C++/Python companion to the
>> Java
>> > > > > > sketches-core. These implementations are binary compatible with
>> their
>> > > > > > counterparts in Java. In other words, a sketch created and
>> stored in
>> > > > C++
>> > > > > > can be opened and read in Java and visa-versa. This site also
>> has our
>> > > > > > Python adaptors that basically wrap the C++ implementations,
>> making
>> > > the
>> > > > > > high performance C++ implementations available from Python.
>> > > > > >    * sketches-postgres: This site provides the
>> postgres-specific
>> > > > adaptors
>> > > > > > that wrap the C++ implementations making them available to the
>> > > Postgres
>> > > > > > database users.
>> > > > > >    * characterization-cpp: This is the C++/Python companion to
>> the
>> > > Java
>> > > > > > characterization repository.
>> > > > > >    * experimental-cpp: This repository is an experimental
>> staging
>> > > area
>> > > > for
>> > > > > > C++ code that will eventually end up in another repository.
>> > > > > >
>> > > > > > * Command-Line Tools
>> > > > > >    * sketches-cmd
>> > > > > >    * homebrew-sketches
>> > > > > >    * homebrew-sketches-cmd
>> > > > > >
>> > > > > > These projects have always been Apache 2.0 licensed. We intend
>> to
>> > > > bundle
>> > > > > > all of these repositories since they are all complementary and
>> should
>> > > > be
>> > > > > > maintained in one project. Prior to our submission, we will
>> combine
>> > > > all of
>> > > > > > these projects into a new git repository.
>> > > > > >
>> > > > > > == Source and Intellectual Property Submission Plan ==
>> > > > > >
>> > > > > > Contributors to the DataSketches project have also signed the
>> Yahoo
>> > > > > > Individual Contributor License Agreement (
>> > > > https://yahoocla.herokuapp.com/
>> > > > > > in order to contribute to the project.
>> > > > > >
>> > > > > > With respect to trademark rights, Yahoo does not hold a
>> trademark on
>> > > > the
>> > > > > > phrase “DataSketches.” Based on feedback and guidance we
>> receive
>> > > > during the
>> > > > > > incubation process, we are open to renaming the project if
>> necessary
>> > > > for
>> > > > > > trademark or other concerns, but we would prefer not to have to
>> do
>> > > > that.
>> > > > > >
>> > > > > > == External Dependencies ==
>> > > > > >
>> > > > > > All external dependencies are licensed under an Apache 2.0 or
>> > > > > > Apache-compatible license. As we grow the DataSketches
>> community
>> we
>> > > > will
>> > > > > > configure our build process to require and validate all
>> contributions
>> > > > and
>> > > > > > dependencies are licensed under the Apache 2.0 license or are
>> under
>> > > an
>> > > > > > Apache-compatible license.
>> > > > > >
>> > > > > > == Required Resources ==
>> > > > > >
>> > > > > > === Mailing Lists ===
>> > > > > >
>> > > > > > We currently use a mix of mailing lists. We will migrate our
>> existing
>> > > > > > mailing lists to the following:
>> > > > > >
>> > > > > > * 

> dev@.apache

>> > > > > >
>> > > > > > * 

> user@.apache

>> > > > > >
>> > > > > > * 

> private@.apache

>> > > > > >
>> > > > > > * 

> commits@.apache

>> > > > > >
>> > > > > > === Source Control ===
>> > > > > >
>> > > > > > The DataSketches team currently uses Git and would like to
>> continue
>> > > to
>> > > > do
>> > > > > > so. We request a Git repository for DataSketches with mirroring
>> to
>> > > > GitHub
>> > > > > > enabled similar the following:
>> > > > > >
>> > > > > > * https://github.com/apache/incubator-datasketches.git
>> > > > > >
>> > > > > > === Issue Tracking ===
>> > > > > >
>> > > > > > We request the creation of an Apache-hosted JIRA. The
>> DataSketches
>> > > > project
>> > > > > > is currently using the public GitHub issue tracker and the
>> public
>> > > > Google
>> > > > > > Groups forum/sketches-user for issue tracking and discussions.
>> We
>> > > will
>> > > > > > migrate and combine from these two sources to the Apache JIRA.
>> > > > > >
>> > > > > > Proposed Jira ID: DATASKETCHES
>> > > > > >
>> > > > > > == Initial Committers ==
>> > > > > >
>> > > > > > The following list of individuals have been extremely active in
>> our
>> > > > > > community and should have write (commit) permissions to the
>> > > repository.
>> > > > > >
>> > > > > > * Eshcar Hillel                      [eshcar at verizonmedia
>> dot
>> com]
>> > > > > >
>> > > > > > * Kevin Lang                    [langk at verizonmedia dot com]
>> > > > > >
>> > > > > > * Roman Leventov              [roman.leventov at c.metamarkets
>> dot
>> > > com]
>> > > > > >
>> > > > > > * Edo Liberty                   [libertye at amazon dot com]
>> > > > > >
>> > > > > > * Jon Malkin                    [jmalkin at verizonmedia dot
>> com]
>> > > > > >
>> > > > > > * Lee Rhodes                  [lrhodes at verizonmedia dot com]
>> &
>> > > > [leerho
>> > > > > > at gmail dot com]
>> > > > > >
>> > > > > > * Alexander Saydakov         [saydakov at verizonmedia dot com]
>> > > > > >
>> > > > > > * Justin Thaler                 [justin.thaler at georgetown
>> dot
>> edu]
>> > > > > >
>> > > > > > == Affiliations ==
>> > > > > >
>> > > > > > The initial committers are from four organizations: Yahoo,
>> Amazon,
>> > > > > > Georgetown University, and Metamarkets/Snap.
>> > > > > >
>> > > > > > === Champion ===
>> > > > > > (Recommended to me: )
>> > > > > >
>> > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613
>> at
>> > > > apache
>> > > > > > dot org]
>> > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
>> > > > > >
>> > > > > > === Nominated Mentors ===
>> > > > > > (Recommended to me: )
>> > > > > >
>> > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613
>> at
>> > > > apache
>> > > > > > dot org]
>> > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
>> > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
>> > > > > >
>> > > > > > === Sponsoring Entity ===
>> > > > > >
>> > > > > > * The Apache Incubator    **** This is our 1st choice ****
>> > > > > >
>> > > > > > * Apache Druid. The incubating Apache Druid project might also
>> be a
>> > > > logical
>> > > > > > sponsor. However, DataSketches has applications in many areas
>> of
>> > > > computing
>> > > > > > outside of Druid so our preference and recommendation is that
>> > > > DataSketches
>> > > > > > would ultimately be a top-level Apache project.
>> > > > > >
>> > > > > > ________________
>> > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with
>> previously
>> > > > acquired
>> > > > > > AOL. The merged entity was originally called Oath, Inc., but
>> has
>> > > > recently
>> > > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary of
>> > > Verizon,
>> > > > > > Inc.  Since Yahoo is the more recognized name, references in
>> this
>> > > > document
>> > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
>> > > > > >
>> > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles &lt;

> kenn@

> &gt; >
>> > > > wrote:
>> > > > > >
>> > > > > > > The subject line has me interested already. Follow examples
>> like
>> > > this
>> > > > > > > maybe?
>> > > > > > >
>> > > > > > > 1.
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > >
>> > >
>> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
>> > > > > > > 2.
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > >
>> > >
>> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
>> > > > > > >
>> > > > > > > Kenn
>> > > > > > >
>> > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho &lt;

> leerho@

> &gt;
>> wrote:
>> > > > > > >
>> > > > > > > > I'll try again ... :)
>> > > > > > > >
>> > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
>> > > 

> ted.dunning@

>> > > > >
>> > > > > > > wrote:
>> > > > > > > >
>> > > > > > > >> It didn't make it again
>> > > > > > > >>
>> > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho &lt;

> leerho@

> &gt;
>> wrote:
>> > > > > > > >>
>> > > > > > > >> > I'm not sure the attached document made it through.
>> > > > > > > >> >
>> > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho &lt;

> leerho@

> &gt;
>> > > > wrote:
>> > > > > > > >> >
>> > > > > > > >> > >
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > > >>
>> > > > > > > >
>> > > > > > > >
>> > > >
>> ---------------------------------------------------------------------
>> > > > > > > > To unsubscribe, e-mail:
>> 

> general-unsubscribe@.apache

>> > > > > > > > For additional commands, e-mail:
>> > > 

> general-help@.apache

>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> ---------------------------------------------------------------------
>> > > > To unsubscribe, e-mail: 

> general-unsubscribe@.apache

>> > > > For additional commands, e-mail: 

> general-help@.apache

>> > > >
>> > > >
>> > >
>> > --
>> > From my cell phone.
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: 

> general-unsubscribe@.apache

>> For additional commands, e-mail: 

> general-help@.apache

>>
>>





--
Sent from: http://apache-incubator-general.996316.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: DataSketches Proposal - Google Docs Link

Reply via email to