It worked. I've updated the shortlink to point to your doc. Kenn
On Tue, Feb 26, 2019 at 4:02 PM Liang Chen <chenliang6...@gmail.com> wrote: > Hi Kenneth > > Please try this link : > > https://docs.google.com/document/d/1_cnesVLtKqPeUYxJvsd_2MTFwgeC1wUqI6cDPCbBRSM/edit#heading=h.97rxea60t2yw > > Regards > Liang > > > Kenneth Knowles wrote > > I could not access that document. I suggest you need to turn on link > > sharing. > > > > Kenn > > > > On Mon, Feb 25, 2019 at 12:00 PM > > > leerho@ > > > < > > > leerho@ > > > > wrote: > > > >> Try this link: > >> > https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing > >> > >> > >> On 2019/02/25 05:55:50, leerho < > > > leerho@ > > > > wrote: > >> > Yes I will try that tomorrow. > >> > > >> > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles < > > > kenn@ > > > > wrote: > >> > > >> > > Can you share the Google doc with the proposal? Per Ted's advice, we > >> can > >> > > iterate quickly there and move it to the wiki when it becomes a bit > >> more > >> > > stable. > >> > > > >> > > Kenn > >> > > > >> > > On Fri, Feb 22, 2019 at 10:21 PM > > > leerho@ > > > < > > > leerho@ > > > > > >> > > wrote: > >> > > > >> > > > Thanks for the offer. i am a neophyte at this process and email > >> app! I > >> > > > could use a lot of help getting this off the ground! Also, I'm > not > >> sure > >> > > > that Mr. Chen and Mr. Onofré have fully accepted taking this on :) > >> > > > > >> > > > Lee. > >> > > > > >> > > > On 2019/02/23 06:03:58, Kenneth Knowles < > > > kenn@ > > > > wrote: > >> > > > > Nice. > >> > > > > > >> > > > > I would very much like to help mentor this project, though you > >> already > >> > > > have > >> > > > > a couple good ones. > >> > > > > > >> > > > > I concur with incubator as sponsoring entity. > >> > > > > > >> > > > > Kenn (VP Apache Beam) > >> > > > > > >> > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho < > > > leerho@ > > > > wrote: > >> > > > > > >> > > > > > I didn't realize that this mail list does not accept PDF > files, > >> > > > apparently > >> > > > > > only text. So let me try one more time ... :) Please let me > >> know if > >> > > > > > this works! > >> > > > > > > >> > > > > > > >> > > > > > = Apache DataSketches Proposal[1] = > >> > > > > > > >> > > > > > == Abstract == > >> > > > > > > >> > > > > > DataSketches.GitHub.io is an open source, high-performance > >> library > >> > > of > >> > > > > > stochastic streaming algorithms commonly called "sketches" in > >> the > >> > > data > >> > > > > > sciences. Sketches are small, stateful programs that process > >> massive > >> > > > data > >> > > > > > as a stream and can provide approximate answers, with > >> mathematical > >> > > > > > guarantees, to computationally difficult queries > >> orders-of-magnitude > >> > > > faster > >> > > > > > than traditional, exact methods. > >> > > > > > > >> > > > > > This proposal is to move DataSketches to the Apache Software > >> > > > > > Foundation(ASF) transferring ownership of its copyright > >> intellectual > >> > > > > > property to the ASF. Thereafter, DataSketches would be > >> officially > >> > > > known as > >> > > > > > Apache DataSketches and its evolution and governance would > come > >> under > >> > > > the > >> > > > > > rules and guidance of the ASF. > >> > > > > > > >> > > > > > == Introduction == > >> > > > > > > >> > > > > > The DataSketches library contains carefully crafted > >> implementations > >> > > of > >> > > > > > sketch algorithms that meet rigorous standards of quality and > >> > > > performance > >> > > > > > and provide capabilities required for large-scale production > >> systems > >> > > > that > >> > > > > > must process and analyze massive data. The DataSketches core > >> > > > repository is > >> > > > > > written in Java with a parallel core repository written in C++ > >> that > >> > > > > > includes Python wrappers. The DataSketches library also > >> includes > >> > > > special > >> > > > > > repositories for extending the core library for Apache Hive > and > >> > > Apache > >> > > > Pig. > >> > > > > > The sketches developed in the different languages share a > >> common > >> > > binary > >> > > > > > storage format so that sketches created and stored in Java, > for > >> > > > example, > >> > > > > > can be fully used in C++, and visa versa. Because the stored > >> sketch > >> > > > > > "images" are just a "blob" of bytes (similar to picture > >> images), > >> they > >> > > > can > >> > > > > > be shared across many different systems, languages and > >> platforms. > >> > > > > > > >> > > > > > The DataSketches documentation website, > >> > > https://datasketches.github.io > >> > > > , > >> > > > > > includes general tutorials, a comprehensive research section > >> with > >> > > > > > references to relevant academic papers, extensive examples for > >> using > >> > > > the > >> > > > > > core library directly as well as examples for accessing the > >> library > >> > > in > >> > > > > > Hive, Pig, and Apache Spark. > >> > > > > > > >> > > > > > The DataSketches library also includes a characterization > >> repository > >> > > > for > >> > > > > > long running test programs that are used for studying accuracy > >> and > >> > > > > > performance of these sketches over wide ranges of input > >> variables. > >> > > The > >> > > > data > >> > > > > > produced by these programs is used for generating the many > >> > > performance > >> > > > > > plots contained in the documentation website and for academic > >> > > > > > publications. > >> > > > > > > >> > > > > > The code repositories used for production are versioned and > >> published > >> > > > to > >> > > > > > Maven Central on periodic intervals as the library evolves. > >> > > > > > > >> > > > > > The DataSketches library also includes several experimental > >> > > > repositories > >> > > > > > for use-cases outside the large-scale systems environments, > >> such > >> as > >> > > > > > sketches for mobile, IoT devices (Android), command-line > access > >> of > >> > > the > >> > > > > > sketch library, and an experimental repository for > vector-based > >> > > > sketches > >> > > > > > that performs approximate Singular Value Decomposition (SVD) > >> analysis > >> > > > that > >> > > > > > could potentially be used in Machine Learning (ML) > >> applications. > >> > > > > > > >> > > > > > == Background == > >> > > > > > > >> > > > > > The DataSketches library was started in 2012 as internal Yahoo > >> > > project > >> > > > to > >> > > > > > dramatically reduce time and resources required for distinct > >> (unique) > >> > > > > > counting. An extensive search on the Internet at the time > >> yielded a > >> > > > number > >> > > > > > of theoretical papers on stochastic streaming algorithms with > >> > > > pseudocode > >> > > > > > examples, but we did not find any usable open-source code of > >> the > >> > > > quality we > >> > > > > > felt we needed for our internal production systems. So we > >> started a > >> > > > small > >> > > > > > project (one person) to develop our own sketches working > >> directly > >> > > from > >> > > > > > published theoretical papers. > >> > > > > > > >> > > > > > The DataSketches library was designed from the start with the > >> > > > objective of > >> > > > > > making these algorithms, usually only described in theoretical > >> > > papers, > >> > > > > > easily accessible to systems developers for use in our > internal > >> > > > production > >> > > > > > systems. By necessity, the code had to be of the highest > >> quality > >> and > >> > > > > > thoroughly tested. The wide variety of our internal production > >> > > systems > >> > > > > > drove the requirement that the sketch implementations had to > >> have an > >> > > > > > absolute minimum of external, run-time dependencies in order > to > >> > > > simplify > >> > > > > > integration and troubleshooting. > >> > > > > > > >> > > > > > Our internal experiments demonstrated dramatic positive impact > >> on the > >> > > > > > performance of our systems. As a result, the DataSketches > >> library > >> > > > quickly > >> > > > > > evolved to include different types of sketches for different > >> types of > >> > > > > > queries, such as frequent-items (a.k.a, heavy-hitters) > >> algorithms, > >> > > > > > quantile/histogram algorithms, and weighted and unweighted > >> sampling > >> > > > > > algorithms. > >> > > > > > > >> > > > > > We quickly discovered that developing these sketch algorithms > >> to > >> be > >> > > > truly > >> > > > > > robust in production environments is quite difficult and > >> requires > >> > > deep > >> > > > > > understanding of the underlying mathematics and statistics as > >> well as > >> > > > > > extensive experience in developing high quality code for 24/7 > >> > > > production > >> > > > > > systems. This is a difficult combination of skills for any one > >> > > > organization > >> > > > > > to collect and maintain over time. It became clear that this > >> > > technology > >> > > > > > needed a community larger than Yahoo to evolve. In November, > >> 2015, > >> > > > this > >> > > > > > factor, along with Yahoo’s strong experience and support of > >> open > >> > > > source, > >> > > > > > led to the decision to open source this technology under an > >> Apache > >> > > 2.0 > >> > > > > > license on GitHub. Since that time our community has expanded > >> > > > considerably > >> > > > > > and the key contributors to this effort includes leading > >> research > >> > > > > > scientists from a number of universities as well as > >> practitioners and > >> > > > > > researchers from a number of major corporations. The core of > >> this > >> > > > group is > >> > > > > > very active as we meet weekly to discuss research directions > >> and > >> > > > > > engineering priorities. > >> > > > > > > >> > > > > > It is important to note that our internal systems at Yahoo use > >> the > >> > > > current > >> > > > > > public GitHub open source DataSketches library and not an > >> internal > >> > > > version > >> > > > > > of the code. > >> > > > > > > >> > > > > > The close collaboration of scientific research and engineering > >> > > > development > >> > > > > > experience with actual massive-data processing systems has > also > >> > > > produced > >> > > > > > new research publications in the field of stochastic streaming > >> > > > algorithms, > >> > > > > > for example: > >> > > > > > > >> > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, > Lee > >> > > > Rhodes, and > >> > > > > > Justin Thaler. A high-performance algorithm for identifying > >> frequent > >> > > > items > >> > > > > > in data streams. In ACM IMC 2017. > >> > > > > > > >> > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin > >> Thaler. A > >> > > > > > framework for estimating stream expression cardinalities. In > >> > > *EDBT/ICDT > >> > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016. > >> > > > > > > >> > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient > >> Frequent > >> > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD > >> Proceedings > >> > > > ‘16, > >> > > > > > pages 845-854, 2016. > >> > > > > > > >> > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal > >> quantile > >> > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages > >> 71–78, > >> > > > 2016. > >> > > > > > > >> > > > > > * Kevin J Lang. Back to the future: an even more nearly > optimal > >> > > > cardinality > >> > > > > > estimation algorithm. arXiv preprint > >> > > https://arxiv.org/abs/1708.06839, > >> > > > > > 2017. > >> > > > > > > >> > > > > > * Edo Liberty. Simple and deterministic matrix sketching. In > >> ACM > >> KDD > >> > > > > > Proceedings ‘13, pages 581– 588, 2013. > >> > > > > > > >> > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and > >> Jonathan > >> > > > Ullman. > >> > > > > > Space lower bounds for itemset frequency sketches. In ACM PODS > >> > > > Proceedings > >> > > > > > ‘16, pages 441–454, 2016. > >> > > > > > > >> > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler. > >> > > Hierarchical > >> > > > > > heavy hitters with the space saving algorithm. In SIAM ALENEX > >> > > > Proceedings > >> > > > > > ‘12, pages 160–174, 2012. > >> > > > > > > >> > > > > > == The Rationale for Sketches == > >> > > > > > > >> > > > > > In the analysis of big data there are often problem queries > >> that > >> > > don’t > >> > > > > > scale because they require huge compute resources and time to > >> > > generate > >> > > > > > exact results. Examples include count distinct, quantiles, > most > >> > > > frequent > >> > > > > > items, joins, matrix computations, and graph analysis. > >> > > > > > > >> > > > > > If we can loosen the requirement of “exact” results from our > >> queries > >> > > > and be > >> > > > > > satisfied with approximate results, within some well > understood > >> > > bounds > >> > > > of > >> > > > > > error, there is an entire branch of mathematics and data > >> science > >> that > >> > > > has > >> > > > > > evolved around developing algorithms that can produce > >> approximate > >> > > > results > >> > > > > > with mathematically well-defined error properties. > >> > > > > > > >> > > > > > With the additional requirements that these algorithms must be > >> small > >> > > > > > (compared to the size of the input data), sublinear (the size > >> of > >> the > >> > > > sketch > >> > > > > > must grow at a slower rate than the size of the input stream), > >> > > > streaming > >> > > > > > (they can only touch each data item once), and mergeable > >> (suitable > >> > > for > >> > > > > > distributed processing), defines a class of algorithms that > can > >> be > >> > > > > > described as small, stochastic, streaming, sublinear mergeable > >> > > > algorithms, > >> > > > > > commonly called sketches (they also have other names, but we > >> will use > >> > > > the > >> > > > > > term sketches from here on). > >> > > > > > > >> > > > > > To be truly streaming and be able to process data in a single > >> pass, > >> > > > > > sketches must make absolute minimum assumptions about the > input > >> > > stream. > >> > > > > > This is critically important, as there is no “second chance” > to > >> > > > process the > >> > > > > > data. > >> > > > > > > >> > > > > > For example, sketches should not make assumptions about the > >> order of > >> > > > stream > >> > > > > > items, the stream length, the dynamic range of values, or the > >> > > > distribution > >> > > > > > of item occurrence frequencies. Sketches should be tolerant of > >> NaNs, > >> > > > Nulls > >> > > > > > and empty objects. About the only thing that the sketch needs > >> to > >> know > >> > > > about > >> > > > > > the stream is how to extract items from it and what type the > >> item is, > >> > > > e.g., > >> > > > > > is it a numeric value or a string. > >> > > > > > > >> > > > > > As far as the sketch is concerned, the input stream is a > >> sequence of > >> > > > items > >> > > > > > in some unknown random order with unknown random values. > >> > > > > > > >> > > > > > The sketch is essentially a complex state machine and combined > >> with > >> > > the > >> > > > > > random input stream defines a stochastic process. We then > apply > >> > > > > > probabilistic methods to interpret the states of the > stochastic > >> > > > process in > >> > > > > > order to extract useful information about the input stream > >> itself. > >> > > The > >> > > > > > resulting information will be approximate, but we also use > >> additional > >> > > > > > probabilistic methods to extract an estimate of the likely > >> > > probability > >> > > > > > distribution of error. > >> > > > > > > >> > > > > > There is a significant scientific contribution here that is > >> defining > >> > > > the > >> > > > > > state machine, understanding the resulting stochastic process, > >> > > > developing > >> > > > > > the probabilistic methods, and proving mathematically, that it > >> all > >> > > > works! > >> > > > > > This is why the scientific contributors to this project are a > >> > > critical > >> > > > and > >> > > > > > strategic component to our success. The development engineers > >> > > > translate > >> > > > > > the concepts of the proposed state machine and probabilistic > >> methods > >> > > > into > >> > > > > > production-quality code. Even more important, they work > closely > >> with > >> > > > the > >> > > > > > scientists, feeding back system and user requirements, which > >> leads > >> > > not > >> > > > only > >> > > > > > to superior product design, but to new science as well. A > >> number of > >> > > > > > scientific papers our members have published (see above) is a > >> direct > >> > > > result > >> > > > > > of this close collaboration. > >> > > > > > > >> > > > > > Because sketches are small they can be processed extremely > >> fast, > >> > > often > >> > > > many > >> > > > > > orders-of-magnitude faster than traditional exact > computations. > >> For > >> > > > > > interactive queries there may not be other viable > alternatives, > >> and > >> > > in > >> > > > the > >> > > > > > case of real-time analysis, sketches are the only known > >> solution. > >> > > > > > > >> > > > > > For any system that needs to extract useful information from > >> massive > >> > > > data > >> > > > > > sketches are essential tools that should be tightly integrated > >> into > >> > > the > >> > > > > > system’s analysis capabilities. This technology has helped > >> Yahoo > >> > > > > > successfully reduce data processing times from days to hours > or > >> > > > minutes on > >> > > > > > a number of its internal platforms and has enabled subsecond > >> queries > >> > > on > >> > > > > > real-time platforms that would have been infeasible without > >> sketches. > >> > > > > > The Rationale for Apache DataSketches > >> > > > > > Other open source implementations of sketch algorithms can be > >> found > >> > > on > >> > > > the > >> > > > > > Internet. However, we have not yet found any open source > >> > > > implementations > >> > > > > > that are as comprehensive, engineered with the quality > required > >> for > >> > > > > > production systems, and with usable and guaranteed error > >> properties. > >> > > > Large > >> > > > > > Internet companies, such as Google and Facebook, have > published > >> > > papers > >> > > > on > >> > > > > > sketching, however, their implementations of their published > >> > > > algorithms are > >> > > > > > proprietary and not available as open source. > >> > > > > > > >> > > > > > The DataSketches library already provides integrations with a > >> number > >> > > of > >> > > > > > major Apache data processing platforms such as Apache Hive, > >> Apache > >> > > Pig, > >> > > > > > Apache Spark and Apache Druid, and is also integrated with a > >> number > >> > > of > >> > > > > > other open source data processing platforms such as Splice > >> Machine, > >> > > > GCHQ > >> > > > > > Gaffer and PostgreSQL. > >> > > > > > > >> > > > > > We believe that having DataSketches as an Apache project will > >> provide > >> > > > an > >> > > > > > immediate, worthwhile, and substantial contribution to the > open > >> > > source > >> > > > > > community, will have a better opportunity to provide a > >> meaningful > >> > > > > > contribution to both the science and engineering of sketching > >> > > > algorithms, > >> > > > > > and integrate with other Apache projects. In addition, this > is > >> a > >> > > > > > significant opportunity for Apache to be the "go-to" > >> destination > >> for > >> > > > users > >> > > > > > that want to leverage this exciting technology. > >> > > > > > > >> > > > > > == Initial Goals == > >> > > > > > > >> > > > > > We are breaking our initial goals into short-term (2-6 months) > >> and > >> > > > > > intermediate to long-term ( 6 months to 2 years): > >> > > > > > > >> > > > > > Our short-term goals include: > >> > > > > > > >> > > > > > * Understanding and adapting to the Apache development process > >> and > >> > > > > > structures. > >> > > > > > > >> > > > > > * Start refactoring codebase and move various DataSketches > >> > > repositories > >> > > > > > code to Apache Git repository. > >> > > > > > > >> > > > > > * Continue development of new features, functions, and fixes. > >> > > > > > > >> > > > > > * Specific sub-projects (e.g., C++ and Python) will continue > to > >> be > >> > > > > > developed and expanded. > >> > > > > > > >> > > > > > > >> > > > > > The intermediate to long term goals include: > >> > > > > > > >> > > > > > * Completing the design and implementation of the C++ sketches > >> to > >> > > > > > complement what is already available in Java, and the Python > >> wrappers > >> > > > of > >> > > > > > those C++ sketches. > >> > > > > > > >> > > > > > * Expanding the C++ build framework to include Windows and the > >> > > popular > >> > > > > > Linux variants. > >> > > > > > > >> > > > > > * Continued engagement with the scientific research community > >> on > >> the > >> > > > > > development of new algorithms for computationally difficult > >> problems > >> > > > that > >> > > > > > heretofore have not had a sketching solution. > >> > > > > > > >> > > > > > == Current Status == > >> > > > > > > >> > > > > > The DataSketches GitHub project has been quite successful. As > >> of > >> > > this > >> > > > > > writing (Feb, 2019) the number of downloads measured by the > >> Nexus > >> > > > > > Repository Manager at https://oss.sonatype.org has grown by > >> nearly a > >> > > > > > factor > >> > > > > > of 10 over the past year to about 55 thousand per month. The > >> > > > > > DataSketches/sketches-core repository has about 560 stars and > >> 141 > >> > > > forks, > >> > > > > > which is pretty good for a highly specialized library. > >> > > > > > > >> > > > > > === Development Practices === > >> > > > > > > >> > > > > > ==== Source Control ==== > >> > > > > > > >> > > > > > All of our developers have extensive experience with Git > >> version > >> > > > control > >> > > > > > and follow accepted practices for use of Pull Requests (PRs), > >> code > >> > > > reviews > >> > > > > > and commits to master, for example. > >> > > > > > > >> > > > > > ==== Testing ==== > >> > > > > > > >> > > > > > Sketches, by their nature are probabilistic programs and don’t > >> > > > necessarily > >> > > > > > behave deterministically. For some of the sketches we > >> intentionally > >> > > > insert > >> > > > > > random noise into the code as this gives us the mathematical > >> > > properties > >> > > > > > that we need to guarantee accuracy. This can make the > behavior > >> of > >> > > > these > >> > > > > > algorithms quite unintuitive and provides significant > >> challenges > >> to > >> > > the > >> > > > > > developer who wishes to test these algorithms for correctness. > >> As a > >> > > > result, > >> > > > > > our testing strategy includes two major components: unit > tests, > >> and > >> > > > > > characterization tests. > >> > > > > > > >> > > > > > ===== Unit Testing ===== > >> > > > > > > >> > > > > > Our unit tests are primarily quick tests to make sure that we > >> > > exercise > >> > > > all > >> > > > > > critical paths in the code and that key branches are executed > >> > > > correctly. It > >> > > > > > is important that they execute relatively fast as they are > >> generally > >> > > > run on > >> > > > > > every code build. The sketches-core repository alone has about > >> 22 > >> > > > thousand > >> > > > > > statements, over 1300 unit tests and code coverage of about > >> 98.2% as > >> > > > > > measured by Atlassian/Clover. It is our goal for all of our > >> code > >> > > > > > repositories that are used in production that they have code > >> coverage > >> > > > > > greater than 90%. > >> > > > > > > >> > > > > > ===== Characterization Testing ===== > >> > > > > > > >> > > > > > In order to test the probabilistic methods that are used to > >> interpret > >> > > > the > >> > > > > > stochastic behaviors of our sketches we have a separate > >> > > > characterization > >> > > > > > repository that is dedicated to this. To measure accuracy, > for > >> > > > example, > >> > > > > > requires running thousands of trials at each of many different > >> points > >> > > > along > >> > > > > > the domain axis. Each trial compares its estimated results > >> against a > >> > > > known > >> > > > > > exact result producing an error for that trial. These error > >> > > > measurements > >> > > > > > are then fed into our Quantiles sketch to capture the actual > >> > > > distribution > >> > > > > > of error at that point along the axis. We then select quantile > >> > > contours > >> > > > > > across all the distributions at points along the axis. These > >> > > contours > >> > > > can > >> > > > > > then be plotted to reveal the shape of the actual error > >> distribution. > >> > > > These > >> > > > > > distributions are not at all Gaussian, in fact they can be > >> quite > >> > > > complex. > >> > > > > > Nonetheless, these distributions are then checked against our > >> > > > statistical > >> > > > > > guarantees inherent to the specific sketch algorithm and its > >> > > > parameters. > >> > > > > > There are many examples of these characterization error > >> distributions > >> > > > on > >> > > > > > our website. The runtimes of these tests can be very long and > >> can > >> > > range > >> > > > > > from many minutes to hours, and some can run for days. > >> Currently, we > >> > > > have > >> > > > > > separate characterization repositories for Java and C++ / > >> Python. > >> > > > > > > >> > > > > > It is our goal that we perform this characterization analysis > >> for all > >> > > > of > >> > > > > > our sketches. By definition, the code that runs these > >> > > characterization > >> > > > > > tests is open-source so others can run these tests as well. > We > >> do > >> > > not > >> > > > have > >> > > > > > formal releases of this code (because it is not production > >> code) > >> and > >> > > > it is > >> > > > > > not published to Maven Central. > >> > > > > > > >> > > > > > === Meritocracy === > >> > > > > > > >> > > > > > DataSketches was initially developed based on requirements > >> within > >> > > > Yahoo. As > >> > > > > > a project on GitHub, DataSketches has received contributions > >> from > >> > > > numerous > >> > > > > > individual developers from around the world, dedicated > research > >> work > >> > > > from > >> > > > > > senior scientists at Amazon and Visa, and academic researchers > >> from > >> > > > > > Georgetown University, Princeton, and MIT. > >> > > > > > > >> > > > > > As a project under incubation, we are committed to expanding > >> our > >> > > > effort to > >> > > > > > build an environment which supports a meritocracy. We are > >> focused on > >> > > > > > engaging the community and other related projects for support > >> and > >> > > > > > contributions. Moreover, we are committed to ensure > >> contributors > >> and > >> > > > > > committers to DataSketches come from a broad mix of > >> organizations > >> > > > through a > >> > > > > > merit-based decision process during incubation. We believe > >> strongly > >> > > in > >> > > > the > >> > > > > > DataSketches premise that fulfills the concept of a well > >> engineered > >> > > and > >> > > > > > scientifically rigorous library that implements these powerful > >> > > > algorithms > >> > > > > > and are committed to growing an inclusive community of > >> DataSketches > >> > > > > > contributors and users. > >> > > > > > > >> > > > > > === Community === > >> > > > > > > >> > > > > > Yahoo has a long history and active engagement in the Open > >> Source > >> > > > > > community. Major projects include: Vespa.ai, Bullet, Moloch, > >> > > Panoptes, > >> > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel, > >> TensorFlowOnSpark, > >> > > > gifshot, > >> > > > > > fluxible, as well as the creation, contribution and incubation > >> of > >> > > many > >> > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie, > >> > > > Zookeeper, > >> > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more. > >> > > > > > > >> > > > > > Every day, DataSketches is actively used by a organizations > and > >> > > > > > institutions around the world for batch and stream processing > >> of > >> > > data. > >> > > > We > >> > > > > > believe acceptance will allow us to consolidate existing > >> > > > > > DataSketches-related work, grow the DataSketches community, > and > >> > > deepen > >> > > > > > connections between DataSketches and other open source > >> projects. > >> > > > > > > >> > > > > > === Introduction to the Core Developers & Contributors === > >> > > > > > > >> > > > > > The core developers and contributors for DataSketches are from > >> > > diverse > >> > > > > > backgrounds, but primarily are scientists that love > engineering > >> and > >> > > > > > engineers that love science. A large part of the value we > bring > >> comes > >> > > > from > >> > > > > > this synthesis. These individuals have already contributed > >> > > > substantially > >> > > > > > to the code, algorithms, and/or mathematical proofs that form > >> the > >> > > > basis of > >> > > > > > the library. > >> > > > > > > >> > > > > > This core group also form the Initial Committers with write > >> > > > permissions to > >> > > > > > the repository. Those marked with (*) Meet weekly to plan the > >> > > research > >> > > > and > >> > > > > > engineering direction of the project. > >> > > > > > > >> > > > > > ==== Scientists That Love Engineering ==== > >> > > > > > > >> > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, > Israel. > >> > > > Interests: > >> > > > > > distributed systems, scalable systems and platforms for big > >> data > >> > > > > > processing, concurrent algorithms and data structures, > >> > > > > > > >> > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo > Labs, > >> > > > Sunnyvale, > >> > > > > > California. Interests: algorithms, theoretical and applied > >> > > mathematics, > >> > > > > > encoding and compression theory, theoretical and applied > >> performance > >> > > > > > optimization. > >> > > > > > > >> > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI > >> Labs, > >> Palo > >> > > > Alto, > >> > > > > > California. Manages the algorithms group at Amazon AI. We > build > >> > > > scalable > >> > > > > > machine learning systems and algorithms which are used both > >> > > internally > >> > > > and > >> > > > > > externally by customers of SageMaker, AWS's flagship machine > >> learning > >> > > > > > platform. > >> > > > > > > >> > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale. > >> Interests: > >> > > > > > Computational advertising, machine learning, speech > >> recognition, > >> > > > > > data-driven analysis, large scale experimentation, big data, > >> > > > stream/complex > >> > > > > > event processing > >> > > > > > > >> > > > > > * Justin Thaler: (*) Assistant Professor, Department of > >> Computer > >> > > > Science, > >> > > > > > Georgetown University, Washington D.C. Interests: algorithms > >> and > >> > > > > > computational complexity, complexity theory, quantum > >> algorithms, > >> > > > private > >> > > > > > data analysis, and learning theory, developing efficient > >> streaming > >> > > and > >> > > > > > sketching algorithms > >> > > > > > > >> > > > > > ==== Engineers That Love Science ==== > >> > > > > > > >> > > > > > * Roman Leventov: Senior Software Engineer, Metamarkets / > >> Snap. > >> > > > Interests: > >> > > > > > design and implementation of data storing and data processing > >> > > > (distributed) > >> > > > > > systems, performance optimization, CPU performance, mechanical > >> > > > sympathy, > >> > > > > > JVM performance, API design, databases, (concurrent) data > >> structures, > >> > > > > > memory management, garbage collection algorithms, language > >> design and > >> > > > > > runtimes (their tradeoffs), distributed systems (cloud) > >> efficiency, > >> > > > Linux, > >> > > > > > code quality, code transformation, pure functional programming > >> > > models, > >> > > > > > Haskell. > >> > > > > > > >> > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer and > >> founder > >> > > > of > >> > > > > > the DataSketches project, Yahoo, Sunnyvale, California. > >> Interests: > >> > > > > > streaming algorithms, mathematics, computer science, high > >> quality and > >> > > > high > >> > > > > > performance code for the analysis of massive data, bridging > the > >> > > divide > >> > > > > > between theory and practice. > >> > > > > > > >> > > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo, > >> Sunnyvale, > >> > > > > > California. Interests: applied mathematics, computer science, > >> big > >> > > data, > >> > > > > > distributed systems. > >> > > > > > > >> > > > > > === Introduction to Additional Interested Contributors === > >> > > > > > > >> > > > > > These folks have been intermittently involved and contributed, > >> but > >> > > are > >> > > > > > strong supporters of this project. > >> > > > > > > >> > > > > > * Frank Grimes: GitHub ID: frankgrimes97 > >> > > > > > > >> > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. > >> Computer > >> > > > Science, > >> > > > > > Univ of Utah. Interests: Machine Learning, Data Mining, matrix > >> > > > > > approximation, streaming algorithms, randomized linear > algebra. > >> > > > > > > >> > > > > > * Christopher Musco: [christopher.musco at gmail dot com] > Ph.D. > >> > > > Computer > >> > > > > > Science, Research Instructor, Princeton University. Interests: > >> > > > algorithmic > >> > > > > > foundations of data science and machine learning, efficient > >> methods > >> > > for > >> > > > > > processing and understanding large datasets, often working at > >> the > >> > > > > > intersection of theoretical computer science, numerical linear > >> > > > algebra, and > >> > > > > > optimization. > >> > > > > > > >> > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. > >> Computer > >> > > > Science, > >> > > > > > Professor, Warwick University, Warwick, England. Interests: > all > >> > > > aspects of > >> > > > > > the "data lifecycle", from data collection and cleaning, > >> through > >> > > > mining and > >> > > > > > analytics. (Professor Cormode is one of the world’s leading > >> > > scientists > >> > > > in > >> > > > > > sketching algorithms) > >> > > > > > > >> > > > > > === Alignment === > >> > > > > > > >> > > > > > The DataSketches library already provides integrations and > >> example > >> > > > code for > >> > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply integrated > >> into > >> > > > Apache > >> > > > > > Druid. > >> > > > > > > >> > > > > > == Known Risks == > >> > > > > > > >> > > > > > The following subsections are specific risks that have been > >> > > identified > >> > > > by > >> > > > > > the ASF that need to be addressed. > >> > > > > > > >> > > > > > === Risk: Orphaned Products === > >> > > > > > > >> > > > > > The DataSketches library is presently used by a number of > >> > > > organizations, > >> > > > > > from small startups to Fortune 100 companies, to construct > >> production > >> > > > > > pipelines that must process and analyze massive data. Yahoo > has > >> a > >> > > > long-term > >> > > > > > commitment to continue to advance the DataSketches library; > >> moreover, > >> > > > > > DataSketches is seeing increasing interest, development, and > >> adoption > >> > > > from > >> > > > > > many diverse organizations from around the world. Due to its > >> growing > >> > > > > > adoption, we feel it is quite unlikely that this project would > >> become > >> > > > > > orphaned. > >> > > > > > > >> > > > > > === Risk: Inexperience with Open Source === > >> > > > > > > >> > > > > > Yahoo believes strongly in open source and the exchange of > >> > > information > >> > > > to > >> > > > > > advance new ideas and work. Examples of this commitment are > >> active > >> > > open > >> > > > > > source projects such as those mentioned above. With > >> DataSketches, we > >> > > > have > >> > > > > > been increasingly open and forward-looking; we have published > a > >> > > number > >> > > > of > >> > > > > > papers about breakthrough developments in the science of > >> streaming > >> > > > > > algorithms (mentioned above) that also reference the > >> DataSketches > >> > > > library. > >> > > > > > Our submission to the Apache Software Foundation is a logical > >> > > > extension of > >> > > > > > our commitment to open source software. > >> > > > > > > >> > > > > > Key committers at Yahoo with strong open source backgrounds > >> include > >> > > > Aaron > >> > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky, > >> > > Andrews > >> > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan > >> Call, > >> > > Daryn > >> > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, > Eshcar > >> > > Hillel, > >> > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco > >> > > Perez-Sorrosal, > >> > > > Gil > >> > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher, > >> James > >> > > > Penick, > >> > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon > >> Eagles, > >> > > > Kihwal > >> > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael > >> Trelinski, > >> > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L. > >> > > Natkovich, > >> > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby > >> Loo, > >> > > > Ryan > >> > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit > >> Chan, > >> > > Sri > >> > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many > >> more. > >> > > > > > > >> > > > > > All of our core developers are committed to learn about the > >> Apache > >> > > > process > >> > > > > > and to give back to the community. > >> > > > > > > >> > > > > > === Risk: Homogeneous Developers === > >> > > > > > > >> > > > > > The majority of committers in this proposal belong to Yahoo > due > >> to > >> > > the > >> > > > fact > >> > > > > > that DataSketches has emerged from an internal Yahoo project. > >> This > >> > > > proposal > >> > > > > > also includes developers and contributors from other > companies, > >> and > >> > > > who are > >> > > > > > actively involved with other Apache projects, such as Druid. > >> We > >> > > > expect our > >> > > > > > entry into incubation will allow us to expand the number of > >> > > > individuals and > >> > > > > > organizations participating in DataSketches development. > >> > > > > > > >> > > > > > === Risk: Reliance on Salaried Developers === > >> > > > > > > >> > > > > > Because the DataSketches library originated within Yahoo, it > >> has > >> been > >> > > > > > developed primarily by salaried Yahoo developers and we expect > >> that > >> > > to > >> > > > > > continue to be the case near term. However, since we placed > >> this > >> > > > library > >> > > > > > into open-source we have had a number of significant > >> contributions > >> > > from > >> > > > > > engineers and scientists from outside of Yahoo. We expect our > >> > > reliance > >> > > > on > >> > > > > > Yahoo salaried developers will decrease over time. > Nonetheless, > >> Yahoo > >> > > > is > >> > > > > > committed to continue its strong support of this important > >> project. > >> > > > > > > >> > > > > > === Risk: Lack of Relationship to other Apache Products === > >> > > > > > > >> > > > > > DataSketches already directly interoperates with or utilizes > >> several > >> > > > > > existing Apache projects. > >> > > > > > > >> > > > > > * Build > >> > > > > > * Apache Maven > >> > > > > > > >> > > > > > * Integrations and adaptors for the following projects > >> naturally > >> have > >> > > > them > >> > > > > > as dependencies > >> > > > > > * Apache Hive > >> > > > > > * Apache Pig > >> > > > > > * Apache Druid > >> > > > > > * Apache Spark > >> > > > > > > >> > > > > > * Additional dependencies for the above integrations and > >> adaptors > >> > > > include > >> > > > > > * Apache Hadoop > >> > > > > > * Apache Commons (Math) > >> > > > > > > >> > > > > > There is no other Apache project that we are aware of that > >> duplicates > >> > > > the > >> > > > > > functionality of the DataSketches library. > >> > > > > > > >> > > > > > === Risk: An Excessive Fascination with the Apache Brand === > >> > > > > > > >> > > > > > With this proposal we are not seeking attention or publicity. > >> Rather, > >> > > > we > >> > > > > > firmly believe in the DataSketches library and concept and the > >> > > ability > >> > > > to > >> > > > > > make the DataSketches library a powerful, yet simple-to-use > >> toolkit > >> > > for > >> > > > > > data processing. While the DataSketches library has been open > >> source, > >> > > > we > >> > > > > > believe putting code on GitHub can only go so far. We see the > >> Apache > >> > > > > > community, processes, and mission as critical for ensuring the > >> > > > DataSketches > >> > > > > > library is truly community-driven, positively impactful, and > >> > > innovative > >> > > > > > open source software. While Yahoo has taken a number of steps > >> to > >> > > > advance > >> > > > > > its various open source projects, we believe the DataSketches > >> library > >> > > > > > project is a great fit for the Apache Software Foundation due > >> to > >> its > >> > > > focus > >> > > > > > on data processing and its relationships to existing ASF > >> projects. > >> > > > > > > >> > > > > > === Risk: Cryptography === > >> > > > > > > >> > > > > > DataSketches does not contain any cryptographic code and is > not > >> a > >> > > > > > cryptographic product. > >> > > > > > > >> > > > > > == Documentation == > >> > > > > > > >> > > > > > The following documentation is relevant to this proposal. > >> Relevant > >> > > > portions > >> > > > > > of the documentation will be contributed to the Apache > >> DataSketches > >> > > > > > project. > >> > > > > > > >> > > > > > * DataSketches website: https://datasketches.github.io. > >> > > > > > > >> > > > > > * DataSketches website repository: > >> > > > > > https://github.com/DataSketches/DataSketches.github.io > >> > > > > > > >> > > > > > We will need an apache website for this documentation similar > >> to > >> > > > > > > >> > > > > > * https://datasketches.apache.org > >> > > > > > > >> > > > > > == Initial Source == > >> > > > > > > >> > > > > > The initial source for DataSketches which we will submit to > the > >> > > Apache > >> > > > > > Foundation will include a number of repositories which are > >> currently > >> > > > hosted > >> > > > > > under the GitHub.com/datasketches organization: > >> > > > > > > >> > > > > > All github.com/datasketches repositories including: > >> > > > > > > >> > > > > > * Java > >> > > > > > * sketches-core: This repository has the core sketching > >> classes, > >> > > > which > >> > > > > > are leveraged by some of the other repositories. This > >> repository > >> has > >> > > no > >> > > > > > external dependencies outside of the DataSketches/memory > >> repository, > >> > > > Java > >> > > > > > and TestNG for unit tests. This code is versioned and the > >> latest > >> > > > release > >> > > > > > can be obtained from Maven Central. > >> > > > > > * memory: Low level, high-performance memory data-structure > >> > > > management > >> > > > > > primarily for off-heap. > >> > > > > > * sketches-android: This is a new repository dedicated to > >> sketches > >> > > > > > designed to be run in a mobile client, such as a cell phone. > It > >> is > >> > > > still in > >> > > > > > development and should be considered experimental. > >> > > > > > * sketches-hive: This repository contains Hive UDFs and > >> UDAFs > >> for > >> > > > use > >> > > > > > within Hadoop grid environments. This code has dependencies on > >> > > > > > sketches-core as well as Hadoop and Hive. Users of this code > >> are > >> > > > advised to > >> > > > > > use Maven to bring in all the required dependencies. This code > >> is > >> > > > versioned > >> > > > > > and the latest release can be obtained from Maven Central. > >> > > > > > * sketches-pig: This repository contains Pig User Defined > >> > > Functions > >> > > > > > (UDF) for use within Hadoop grid environments. This code has > >> > > > dependencies > >> > > > > > on sketches-core as well as Hadoop and Pig. Users of this code > >> are > >> > > > advised > >> > > > > > to use Maven to bring in all the required dependencies. This > >> code is > >> > > > > > versioned and the latest release can be obtained from Maven > >> Central. > >> > > > > > * sketches-vector: This is a new repository dedicated to > >> sketches > >> > > > for > >> > > > > > vector and matrix operations. It is still somewhat > >> experimental. > >> > > > > > * characterization: This relatively new repository is for > >> code > >> > > that > >> > > > we > >> > > > > > use to characterize the accuracy and speed performance of the > >> > > sketches > >> > > > in > >> > > > > > the library and is constantly being updated. Examples of the > >> job > >> > > > command > >> > > > > > files used for various tests can be found in the > >> src/main/resources > >> > > > > > directory. Some of these tests can run for hours depending on > >> its > >> > > > > > configuration. > >> > > > > > * experimental: This repository is an experimental staging > >> area > >> > > for > >> > > > code > >> > > > > > that will eventually end up in another repository. This code > is > >> not > >> > > > > > versioned and not registered with Maven Central. > >> > > > > > * sketches-misc: Demos and other code not related to > >> production > >> > > > > > deployment > >> > > > > > > >> > > > > > * C++ and Python > >> > > > > > * sketches-core-cpp: This is the C++/Python companion to > the > >> Java > >> > > > > > sketches-core. These implementations are binary compatible > with > >> their > >> > > > > > counterparts in Java. In other words, a sketch created and > >> stored in > >> > > > C++ > >> > > > > > can be opened and read in Java and visa-versa. This site also > >> has our > >> > > > > > Python adaptors that basically wrap the C++ implementations, > >> making > >> > > the > >> > > > > > high performance C++ implementations available from Python. > >> > > > > > * sketches-postgres: This site provides the > >> postgres-specific > >> > > > adaptors > >> > > > > > that wrap the C++ implementations making them available to the > >> > > Postgres > >> > > > > > database users. > >> > > > > > * characterization-cpp: This is the C++/Python companion to > >> the > >> > > Java > >> > > > > > characterization repository. > >> > > > > > * experimental-cpp: This repository is an experimental > >> staging > >> > > area > >> > > > for > >> > > > > > C++ code that will eventually end up in another repository. > >> > > > > > > >> > > > > > * Command-Line Tools > >> > > > > > * sketches-cmd > >> > > > > > * homebrew-sketches > >> > > > > > * homebrew-sketches-cmd > >> > > > > > > >> > > > > > These projects have always been Apache 2.0 licensed. We intend > >> to > >> > > > bundle > >> > > > > > all of these repositories since they are all complementary and > >> should > >> > > > be > >> > > > > > maintained in one project. Prior to our submission, we will > >> combine > >> > > > all of > >> > > > > > these projects into a new git repository. > >> > > > > > > >> > > > > > == Source and Intellectual Property Submission Plan == > >> > > > > > > >> > > > > > Contributors to the DataSketches project have also signed the > >> Yahoo > >> > > > > > Individual Contributor License Agreement ( > >> > > > https://yahoocla.herokuapp.com/ > >> > > > > > in order to contribute to the project. > >> > > > > > > >> > > > > > With respect to trademark rights, Yahoo does not hold a > >> trademark on > >> > > > the > >> > > > > > phrase “DataSketches.” Based on feedback and guidance we > >> receive > >> > > > during the > >> > > > > > incubation process, we are open to renaming the project if > >> necessary > >> > > > for > >> > > > > > trademark or other concerns, but we would prefer not to have > to > >> do > >> > > > that. > >> > > > > > > >> > > > > > == External Dependencies == > >> > > > > > > >> > > > > > All external dependencies are licensed under an Apache 2.0 or > >> > > > > > Apache-compatible license. As we grow the DataSketches > >> community > >> we > >> > > > will > >> > > > > > configure our build process to require and validate all > >> contributions > >> > > > and > >> > > > > > dependencies are licensed under the Apache 2.0 license or are > >> under > >> > > an > >> > > > > > Apache-compatible license. > >> > > > > > > >> > > > > > == Required Resources == > >> > > > > > > >> > > > > > === Mailing Lists === > >> > > > > > > >> > > > > > We currently use a mix of mailing lists. We will migrate our > >> existing > >> > > > > > mailing lists to the following: > >> > > > > > > >> > > > > > * > > > dev@.apache > > >> > > > > > > >> > > > > > * > > > user@.apache > > >> > > > > > > >> > > > > > * > > > private@.apache > > >> > > > > > > >> > > > > > * > > > commits@.apache > > >> > > > > > > >> > > > > > === Source Control === > >> > > > > > > >> > > > > > The DataSketches team currently uses Git and would like to > >> continue > >> > > to > >> > > > do > >> > > > > > so. We request a Git repository for DataSketches with > mirroring > >> to > >> > > > GitHub > >> > > > > > enabled similar the following: > >> > > > > > > >> > > > > > * https://github.com/apache/incubator-datasketches.git > >> > > > > > > >> > > > > > === Issue Tracking === > >> > > > > > > >> > > > > > We request the creation of an Apache-hosted JIRA. The > >> DataSketches > >> > > > project > >> > > > > > is currently using the public GitHub issue tracker and the > >> public > >> > > > Google > >> > > > > > Groups forum/sketches-user for issue tracking and discussions. > >> We > >> > > will > >> > > > > > migrate and combine from these two sources to the Apache JIRA. > >> > > > > > > >> > > > > > Proposed Jira ID: DATASKETCHES > >> > > > > > > >> > > > > > == Initial Committers == > >> > > > > > > >> > > > > > The following list of individuals have been extremely active > in > >> our > >> > > > > > community and should have write (commit) permissions to the > >> > > repository. > >> > > > > > > >> > > > > > * Eshcar Hillel [eshcar at verizonmedia > >> dot > >> com] > >> > > > > > > >> > > > > > * Kevin Lang [langk at verizonmedia dot > com] > >> > > > > > > >> > > > > > * Roman Leventov [roman.leventov at c.metamarkets > >> dot > >> > > com] > >> > > > > > > >> > > > > > * Edo Liberty [libertye at amazon dot com] > >> > > > > > > >> > > > > > * Jon Malkin [jmalkin at verizonmedia dot > >> com] > >> > > > > > > >> > > > > > * Lee Rhodes [lrhodes at verizonmedia dot > com] > >> & > >> > > > [leerho > >> > > > > > at gmail dot com] > >> > > > > > > >> > > > > > * Alexander Saydakov [saydakov at verizonmedia dot > com] > >> > > > > > > >> > > > > > * Justin Thaler [justin.thaler at georgetown > >> dot > >> edu] > >> > > > > > > >> > > > > > == Affiliations == > >> > > > > > > >> > > > > > The initial committers are from four organizations: Yahoo, > >> Amazon, > >> > > > > > Georgetown University, and Metamarkets/Snap. > >> > > > > > > >> > > > > > === Champion === > >> > > > > > (Recommended to me: ) > >> > > > > > > >> > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 > >> at > >> > > > apache > >> > > > > > dot org] > >> > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net] > >> > > > > > > >> > > > > > === Nominated Mentors === > >> > > > > > (Recommended to me: ) > >> > > > > > > >> > > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 > >> at > >> > > > apache > >> > > > > > dot org] > >> > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net > >> > > > > > Gil Yehuda, gyehuda at verizonmedia dot com > >> > > > > > > >> > > > > > === Sponsoring Entity === > >> > > > > > > >> > > > > > * The Apache Incubator **** This is our 1st choice **** > >> > > > > > > >> > > > > > * Apache Druid. The incubating Apache Druid project might also > >> be a > >> > > > logical > >> > > > > > sponsor. However, DataSketches has applications in many areas > >> of > >> > > > computing > >> > > > > > outside of Druid so our preference and recommendation is that > >> > > > DataSketches > >> > > > > > would ultimately be a top-level Apache project. > >> > > > > > > >> > > > > > ________________ > >> > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with > >> previously > >> > > > acquired > >> > > > > > AOL. The merged entity was originally called Oath, Inc., but > >> has > >> > > > recently > >> > > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary of > >> > > Verizon, > >> > > > > > Inc. Since Yahoo is the more recognized name, references in > >> this > >> > > > document > >> > > > > > to Yahoo, are also a reference to Verizon Media, Inc. > >> > > > > > > >> > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles < > > > kenn@ > > > > > > >> > > > wrote: > >> > > > > > > >> > > > > > > The subject line has me interested already. Follow examples > >> like > >> > > this > >> > > > > > > maybe? > >> > > > > > > > >> > > > > > > 1. > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > >> > > > >> > https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E > >> > > > > > > 2. > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > >> > > > >> > https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E > >> > > > > > > > >> > > > > > > Kenn > >> > > > > > > > >> > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho < > > > leerho@ > > > > > >> wrote: > >> > > > > > > > >> > > > > > > > I'll try again ... :) > >> > > > > > > > > >> > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning < > >> > > > > > ted.dunning@ > > >> > > > > > >> > > > > > > wrote: > >> > > > > > > > > >> > > > > > > >> It didn't make it again > >> > > > > > > >> > >> > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho < > > > leerho@ > > > > > >> wrote: > >> > > > > > > >> > >> > > > > > > >> > I'm not sure the attached document made it through. > >> > > > > > > >> > > >> > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho < > > > leerho@ > > > > > >> > > > wrote: > >> > > > > > > >> > > >> > > > > > > >> > > > >> > > > > > > >> > > > >> > > > > > > >> > > >> > > > > > > >> > >> > > > > > > > > >> > > > > > > > > >> > > > > >> --------------------------------------------------------------------- > >> > > > > > > > To unsubscribe, e-mail: > >> > > > general-unsubscribe@.apache > > >> > > > > > > > For additional commands, e-mail: > >> > > > > > general-help@.apache > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> --------------------------------------------------------------------- > >> > > > To unsubscribe, e-mail: > > > general-unsubscribe@.apache > > >> > > > For additional commands, e-mail: > > > general-help@.apache > > >> > > > > >> > > > > >> > > > >> > -- > >> > From my cell phone. > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: > > > general-unsubscribe@.apache > > >> For additional commands, e-mail: > > > general-help@.apache > > >> > >> > > > > > > -- > Sent from: http://apache-incubator-general.996316.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >