Thanks for the offer. i am a neophyte at this process and email app! I could use a lot of help getting this off the ground! Also, I'm not sure that Mr. Chen and Mr. Onofré have fully accepted taking this on :)
Lee. On 2019/02/23 06:03:58, Kenneth Knowles <k...@apache.org> wrote: > Nice. > > I would very much like to help mentor this project, though you already have > a couple good ones. > > I concur with incubator as sponsoring entity. > > Kenn (VP Apache Beam) > > On Fri, Feb 22, 2019 at 9:45 PM leerho <lee...@gmail.com> wrote: > > > I didn't realize that this mail list does not accept PDF files, apparently > > only text. So let me try one more time ... :) Please let me know if > > this works! > > > > > > = Apache DataSketches Proposal[1] = > > > > == Abstract == > > > > DataSketches.GitHub.io is an open source, high-performance library of > > stochastic streaming algorithms commonly called "sketches" in the data > > sciences. Sketches are small, stateful programs that process massive data > > as a stream and can provide approximate answers, with mathematical > > guarantees, to computationally difficult queries orders-of-magnitude faster > > than traditional, exact methods. > > > > This proposal is to move DataSketches to the Apache Software > > Foundation(ASF) transferring ownership of its copyright intellectual > > property to the ASF. Thereafter, DataSketches would be officially known as > > Apache DataSketches and its evolution and governance would come under the > > rules and guidance of the ASF. > > > > == Introduction == > > > > The DataSketches library contains carefully crafted implementations of > > sketch algorithms that meet rigorous standards of quality and performance > > and provide capabilities required for large-scale production systems that > > must process and analyze massive data. The DataSketches core repository is > > written in Java with a parallel core repository written in C++ that > > includes Python wrappers. The DataSketches library also includes special > > repositories for extending the core library for Apache Hive and Apache Pig. > > The sketches developed in the different languages share a common binary > > storage format so that sketches created and stored in Java, for example, > > can be fully used in C++, and visa versa. Because the stored sketch > > "images" are just a "blob" of bytes (similar to picture images), they can > > be shared across many different systems, languages and platforms. > > > > The DataSketches documentation website, https://datasketches.github.io , > > includes general tutorials, a comprehensive research section with > > references to relevant academic papers, extensive examples for using the > > core library directly as well as examples for accessing the library in > > Hive, Pig, and Apache Spark. > > > > The DataSketches library also includes a characterization repository for > > long running test programs that are used for studying accuracy and > > performance of these sketches over wide ranges of input variables. The data > > produced by these programs is used for generating the many performance > > plots contained in the documentation website and for academic > > publications. > > > > The code repositories used for production are versioned and published to > > Maven Central on periodic intervals as the library evolves. > > > > The DataSketches library also includes several experimental repositories > > for use-cases outside the large-scale systems environments, such as > > sketches for mobile, IoT devices (Android), command-line access of the > > sketch library, and an experimental repository for vector-based sketches > > that performs approximate Singular Value Decomposition (SVD) analysis that > > could potentially be used in Machine Learning (ML) applications. > > > > == Background == > > > > The DataSketches library was started in 2012 as internal Yahoo project to > > dramatically reduce time and resources required for distinct (unique) > > counting. An extensive search on the Internet at the time yielded a number > > of theoretical papers on stochastic streaming algorithms with pseudocode > > examples, but we did not find any usable open-source code of the quality we > > felt we needed for our internal production systems. So we started a small > > project (one person) to develop our own sketches working directly from > > published theoretical papers. > > > > The DataSketches library was designed from the start with the objective of > > making these algorithms, usually only described in theoretical papers, > > easily accessible to systems developers for use in our internal production > > systems. By necessity, the code had to be of the highest quality and > > thoroughly tested. The wide variety of our internal production systems > > drove the requirement that the sketch implementations had to have an > > absolute minimum of external, run-time dependencies in order to simplify > > integration and troubleshooting. > > > > Our internal experiments demonstrated dramatic positive impact on the > > performance of our systems. As a result, the DataSketches library quickly > > evolved to include different types of sketches for different types of > > queries, such as frequent-items (a.k.a, heavy-hitters) algorithms, > > quantile/histogram algorithms, and weighted and unweighted sampling > > algorithms. > > > > We quickly discovered that developing these sketch algorithms to be truly > > robust in production environments is quite difficult and requires deep > > understanding of the underlying mathematics and statistics as well as > > extensive experience in developing high quality code for 24/7 production > > systems. This is a difficult combination of skills for any one organization > > to collect and maintain over time. It became clear that this technology > > needed a community larger than Yahoo to evolve. In November, 2015, this > > factor, along with Yahoo’s strong experience and support of open source, > > led to the decision to open source this technology under an Apache 2.0 > > license on GitHub. Since that time our community has expanded considerably > > and the key contributors to this effort includes leading research > > scientists from a number of universities as well as practitioners and > > researchers from a number of major corporations. The core of this group is > > very active as we meet weekly to discuss research directions and > > engineering priorities. > > > > It is important to note that our internal systems at Yahoo use the current > > public GitHub open source DataSketches library and not an internal version > > of the code. > > > > The close collaboration of scientific research and engineering development > > experience with actual massive-data processing systems has also produced > > new research publications in the field of stochastic streaming algorithms, > > for example: > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee Rhodes, and > > Justin Thaler. A high-performance algorithm for identifying frequent items > > in data streams. In ACM IMC 2017. > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A > > framework for estimating stream expression cardinalities. In *EDBT/ICDT > > Proceedings ‘16 *, pages 6:1–6:17, 2016. > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient Frequent > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD Proceedings ‘16, > > pages 845-854, 2016. > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages 71–78, 2016. > > > > * Kevin J Lang. Back to the future: an even more nearly optimal cardinality > > estimation algorithm. arXiv preprint https://arxiv.org/abs/1708.06839, > > 2017. > > > > * Edo Liberty. Simple and deterministic matrix sketching. In ACM KDD > > Proceedings ‘13, pages 581– 588, 2013. > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan Ullman. > > Space lower bounds for itemset frequency sketches. In ACM PODS Proceedings > > ‘16, pages 441–454, 2016. > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler. Hierarchical > > heavy hitters with the space saving algorithm. In SIAM ALENEX Proceedings > > ‘12, pages 160–174, 2012. > > > > == The Rationale for Sketches == > > > > In the analysis of big data there are often problem queries that don’t > > scale because they require huge compute resources and time to generate > > exact results. Examples include count distinct, quantiles, most frequent > > items, joins, matrix computations, and graph analysis. > > > > If we can loosen the requirement of “exact” results from our queries and be > > satisfied with approximate results, within some well understood bounds of > > error, there is an entire branch of mathematics and data science that has > > evolved around developing algorithms that can produce approximate results > > with mathematically well-defined error properties. > > > > With the additional requirements that these algorithms must be small > > (compared to the size of the input data), sublinear (the size of the sketch > > must grow at a slower rate than the size of the input stream), streaming > > (they can only touch each data item once), and mergeable (suitable for > > distributed processing), defines a class of algorithms that can be > > described as small, stochastic, streaming, sublinear mergeable algorithms, > > commonly called sketches (they also have other names, but we will use the > > term sketches from here on). > > > > To be truly streaming and be able to process data in a single pass, > > sketches must make absolute minimum assumptions about the input stream. > > This is critically important, as there is no “second chance” to process the > > data. > > > > For example, sketches should not make assumptions about the order of stream > > items, the stream length, the dynamic range of values, or the distribution > > of item occurrence frequencies. Sketches should be tolerant of NaNs, Nulls > > and empty objects. About the only thing that the sketch needs to know about > > the stream is how to extract items from it and what type the item is, e.g., > > is it a numeric value or a string. > > > > As far as the sketch is concerned, the input stream is a sequence of items > > in some unknown random order with unknown random values. > > > > The sketch is essentially a complex state machine and combined with the > > random input stream defines a stochastic process. We then apply > > probabilistic methods to interpret the states of the stochastic process in > > order to extract useful information about the input stream itself. The > > resulting information will be approximate, but we also use additional > > probabilistic methods to extract an estimate of the likely probability > > distribution of error. > > > > There is a significant scientific contribution here that is defining the > > state machine, understanding the resulting stochastic process, developing > > the probabilistic methods, and proving mathematically, that it all works! > > This is why the scientific contributors to this project are a critical and > > strategic component to our success. The development engineers translate > > the concepts of the proposed state machine and probabilistic methods into > > production-quality code. Even more important, they work closely with the > > scientists, feeding back system and user requirements, which leads not only > > to superior product design, but to new science as well. A number of > > scientific papers our members have published (see above) is a direct result > > of this close collaboration. > > > > Because sketches are small they can be processed extremely fast, often many > > orders-of-magnitude faster than traditional exact computations. For > > interactive queries there may not be other viable alternatives, and in the > > case of real-time analysis, sketches are the only known solution. > > > > For any system that needs to extract useful information from massive data > > sketches are essential tools that should be tightly integrated into the > > system’s analysis capabilities. This technology has helped Yahoo > > successfully reduce data processing times from days to hours or minutes on > > a number of its internal platforms and has enabled subsecond queries on > > real-time platforms that would have been infeasible without sketches. > > The Rationale for Apache DataSketches > > Other open source implementations of sketch algorithms can be found on the > > Internet. However, we have not yet found any open source implementations > > that are as comprehensive, engineered with the quality required for > > production systems, and with usable and guaranteed error properties. Large > > Internet companies, such as Google and Facebook, have published papers on > > sketching, however, their implementations of their published algorithms are > > proprietary and not available as open source. > > > > The DataSketches library already provides integrations with a number of > > major Apache data processing platforms such as Apache Hive, Apache Pig, > > Apache Spark and Apache Druid, and is also integrated with a number of > > other open source data processing platforms such as Splice Machine, GCHQ > > Gaffer and PostgreSQL. > > > > We believe that having DataSketches as an Apache project will provide an > > immediate, worthwhile, and substantial contribution to the open source > > community, will have a better opportunity to provide a meaningful > > contribution to both the science and engineering of sketching algorithms, > > and integrate with other Apache projects. In addition, this is a > > significant opportunity for Apache to be the "go-to" destination for users > > that want to leverage this exciting technology. > > > > == Initial Goals == > > > > We are breaking our initial goals into short-term (2-6 months) and > > intermediate to long-term ( 6 months to 2 years): > > > > Our short-term goals include: > > > > * Understanding and adapting to the Apache development process and > > structures. > > > > * Start refactoring codebase and move various DataSketches repositories > > code to Apache Git repository. > > > > * Continue development of new features, functions, and fixes. > > > > * Specific sub-projects (e.g., C++ and Python) will continue to be > > developed and expanded. > > > > > > The intermediate to long term goals include: > > > > * Completing the design and implementation of the C++ sketches to > > complement what is already available in Java, and the Python wrappers of > > those C++ sketches. > > > > * Expanding the C++ build framework to include Windows and the popular > > Linux variants. > > > > * Continued engagement with the scientific research community on the > > development of new algorithms for computationally difficult problems that > > heretofore have not had a sketching solution. > > > > == Current Status == > > > > The DataSketches GitHub project has been quite successful. As of this > > writing (Feb, 2019) the number of downloads measured by the Nexus > > Repository Manager at https://oss.sonatype.org has grown by nearly a > > factor > > of 10 over the past year to about 55 thousand per month. The > > DataSketches/sketches-core repository has about 560 stars and 141 forks, > > which is pretty good for a highly specialized library. > > > > === Development Practices === > > > > ==== Source Control ==== > > > > All of our developers have extensive experience with Git version control > > and follow accepted practices for use of Pull Requests (PRs), code reviews > > and commits to master, for example. > > > > ==== Testing ==== > > > > Sketches, by their nature are probabilistic programs and don’t necessarily > > behave deterministically. For some of the sketches we intentionally insert > > random noise into the code as this gives us the mathematical properties > > that we need to guarantee accuracy. This can make the behavior of these > > algorithms quite unintuitive and provides significant challenges to the > > developer who wishes to test these algorithms for correctness. As a result, > > our testing strategy includes two major components: unit tests, and > > characterization tests. > > > > ===== Unit Testing ===== > > > > Our unit tests are primarily quick tests to make sure that we exercise all > > critical paths in the code and that key branches are executed correctly. It > > is important that they execute relatively fast as they are generally run on > > every code build. The sketches-core repository alone has about 22 thousand > > statements, over 1300 unit tests and code coverage of about 98.2% as > > measured by Atlassian/Clover. It is our goal for all of our code > > repositories that are used in production that they have code coverage > > greater than 90%. > > > > ===== Characterization Testing ===== > > > > In order to test the probabilistic methods that are used to interpret the > > stochastic behaviors of our sketches we have a separate characterization > > repository that is dedicated to this. To measure accuracy, for example, > > requires running thousands of trials at each of many different points along > > the domain axis. Each trial compares its estimated results against a known > > exact result producing an error for that trial. These error measurements > > are then fed into our Quantiles sketch to capture the actual distribution > > of error at that point along the axis. We then select quantile contours > > across all the distributions at points along the axis. These contours can > > then be plotted to reveal the shape of the actual error distribution. These > > distributions are not at all Gaussian, in fact they can be quite complex. > > Nonetheless, these distributions are then checked against our statistical > > guarantees inherent to the specific sketch algorithm and its parameters. > > There are many examples of these characterization error distributions on > > our website. The runtimes of these tests can be very long and can range > > from many minutes to hours, and some can run for days. Currently, we have > > separate characterization repositories for Java and C++ / Python. > > > > It is our goal that we perform this characterization analysis for all of > > our sketches. By definition, the code that runs these characterization > > tests is open-source so others can run these tests as well. We do not have > > formal releases of this code (because it is not production code) and it is > > not published to Maven Central. > > > > === Meritocracy === > > > > DataSketches was initially developed based on requirements within Yahoo. As > > a project on GitHub, DataSketches has received contributions from numerous > > individual developers from around the world, dedicated research work from > > senior scientists at Amazon and Visa, and academic researchers from > > Georgetown University, Princeton, and MIT. > > > > As a project under incubation, we are committed to expanding our effort to > > build an environment which supports a meritocracy. We are focused on > > engaging the community and other related projects for support and > > contributions. Moreover, we are committed to ensure contributors and > > committers to DataSketches come from a broad mix of organizations through a > > merit-based decision process during incubation. We believe strongly in the > > DataSketches premise that fulfills the concept of a well engineered and > > scientifically rigorous library that implements these powerful algorithms > > and are committed to growing an inclusive community of DataSketches > > contributors and users. > > > > === Community === > > > > Yahoo has a long history and active engagement in the Open Source > > community. Major projects include: Vespa.ai, Bullet, Moloch, Panoptes, > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel, TensorFlowOnSpark, gifshot, > > fluxible, as well as the creation, contribution and incubation of many > > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie, Zookeeper, > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more. > > > > Every day, DataSketches is actively used by a organizations and > > institutions around the world for batch and stream processing of data. We > > believe acceptance will allow us to consolidate existing > > DataSketches-related work, grow the DataSketches community, and deepen > > connections between DataSketches and other open source projects. > > > > === Introduction to the Core Developers & Contributors === > > > > The core developers and contributors for DataSketches are from diverse > > backgrounds, but primarily are scientists that love engineering and > > engineers that love science. A large part of the value we bring comes from > > this synthesis. These individuals have already contributed substantially > > to the code, algorithms, and/or mathematical proofs that form the basis of > > the library. > > > > This core group also form the Initial Committers with write permissions to > > the repository. Those marked with (*) Meet weekly to plan the research and > > engineering direction of the project. > > > > ==== Scientists That Love Engineering ==== > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel. Interests: > > distributed systems, scalable systems and platforms for big data > > processing, concurrent algorithms and data structures, > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs, Sunnyvale, > > California. Interests: algorithms, theoretical and applied mathematics, > > encoding and compression theory, theoretical and applied performance > > optimization. > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI Labs, Palo Alto, > > California. Manages the algorithms group at Amazon AI. We build scalable > > machine learning systems and algorithms which are used both internally and > > externally by customers of SageMaker, AWS's flagship machine learning > > platform. > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale. Interests: > > Computational advertising, machine learning, speech recognition, > > data-driven analysis, large scale experimentation, big data, stream/complex > > event processing > > > > * Justin Thaler: (*) Assistant Professor, Department of Computer Science, > > Georgetown University, Washington D.C. Interests: algorithms and > > computational complexity, complexity theory, quantum algorithms, private > > data analysis, and learning theory, developing efficient streaming and > > sketching algorithms > > > > ==== Engineers That Love Science ==== > > > > * Roman Leventov: Senior Software Engineer, Metamarkets / Snap. Interests: > > design and implementation of data storing and data processing (distributed) > > systems, performance optimization, CPU performance, mechanical sympathy, > > JVM performance, API design, databases, (concurrent) data structures, > > memory management, garbage collection algorithms, language design and > > runtimes (their tradeoffs), distributed systems (cloud) efficiency, Linux, > > code quality, code transformation, pure functional programming models, > > Haskell. > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer and founder of > > the DataSketches project, Yahoo, Sunnyvale, California. Interests: > > streaming algorithms, mathematics, computer science, high quality and high > > performance code for the analysis of massive data, bridging the divide > > between theory and practice. > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo, Sunnyvale, > > California. Interests: applied mathematics, computer science, big data, > > distributed systems. > > > > === Introduction to Additional Interested Contributors === > > > > These folks have been intermittently involved and contributed, but are > > strong supporters of this project. > > > > * Frank Grimes: GitHub ID: frankgrimes97 > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer Science, > > Univ of Utah. Interests: Machine Learning, Data Mining, matrix > > approximation, streaming algorithms, randomized linear algebra. > > > > * Christopher Musco: [christopher.musco at gmail dot com] Ph.D. Computer > > Science, Research Instructor, Princeton University. Interests: algorithmic > > foundations of data science and machine learning, efficient methods for > > processing and understanding large datasets, often working at the > > intersection of theoretical computer science, numerical linear algebra, and > > optimization. > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. Computer Science, > > Professor, Warwick University, Warwick, England. Interests: all aspects of > > the "data lifecycle", from data collection and cleaning, through mining and > > analytics. (Professor Cormode is one of the world’s leading scientists in > > sketching algorithms) > > > > === Alignment === > > > > The DataSketches library already provides integrations and example code for > > Apache Hive, Apache Pig, Apache Spark and is deeply integrated into Apache > > Druid. > > > > == Known Risks == > > > > The following subsections are specific risks that have been identified by > > the ASF that need to be addressed. > > > > === Risk: Orphaned Products === > > > > The DataSketches library is presently used by a number of organizations, > > from small startups to Fortune 100 companies, to construct production > > pipelines that must process and analyze massive data. Yahoo has a long-term > > commitment to continue to advance the DataSketches library; moreover, > > DataSketches is seeing increasing interest, development, and adoption from > > many diverse organizations from around the world. Due to its growing > > adoption, we feel it is quite unlikely that this project would become > > orphaned. > > > > === Risk: Inexperience with Open Source === > > > > Yahoo believes strongly in open source and the exchange of information to > > advance new ideas and work. Examples of this commitment are active open > > source projects such as those mentioned above. With DataSketches, we have > > been increasingly open and forward-looking; we have published a number of > > papers about breakthrough developments in the science of streaming > > algorithms (mentioned above) that also reference the DataSketches library. > > Our submission to the Apache Software Foundation is a logical extension of > > our commitment to open source software. > > > > Key committers at Yahoo with strong open source backgrounds include Aaron > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky, Andrews > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan Call, Daryn > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar Hillel, > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco Perez-Sorrosal, Gil > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher, James Penick, > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon Eagles, Kihwal > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael Trelinski, > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L. Natkovich, > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby Loo, Ryan > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit Chan, Sri > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many more. > > > > All of our core developers are committed to learn about the Apache process > > and to give back to the community. > > > > === Risk: Homogeneous Developers === > > > > The majority of committers in this proposal belong to Yahoo due to the fact > > that DataSketches has emerged from an internal Yahoo project. This proposal > > also includes developers and contributors from other companies, and who are > > actively involved with other Apache projects, such as Druid. We expect our > > entry into incubation will allow us to expand the number of individuals and > > organizations participating in DataSketches development. > > > > === Risk: Reliance on Salaried Developers === > > > > Because the DataSketches library originated within Yahoo, it has been > > developed primarily by salaried Yahoo developers and we expect that to > > continue to be the case near term. However, since we placed this library > > into open-source we have had a number of significant contributions from > > engineers and scientists from outside of Yahoo. We expect our reliance on > > Yahoo salaried developers will decrease over time. Nonetheless, Yahoo is > > committed to continue its strong support of this important project. > > > > === Risk: Lack of Relationship to other Apache Products === > > > > DataSketches already directly interoperates with or utilizes several > > existing Apache projects. > > > > * Build > > * Apache Maven > > > > * Integrations and adaptors for the following projects naturally have them > > as dependencies > > * Apache Hive > > * Apache Pig > > * Apache Druid > > * Apache Spark > > > > * Additional dependencies for the above integrations and adaptors include > > * Apache Hadoop > > * Apache Commons (Math) > > > > There is no other Apache project that we are aware of that duplicates the > > functionality of the DataSketches library. > > > > === Risk: An Excessive Fascination with the Apache Brand === > > > > With this proposal we are not seeking attention or publicity. Rather, we > > firmly believe in the DataSketches library and concept and the ability to > > make the DataSketches library a powerful, yet simple-to-use toolkit for > > data processing. While the DataSketches library has been open source, we > > believe putting code on GitHub can only go so far. We see the Apache > > community, processes, and mission as critical for ensuring the DataSketches > > library is truly community-driven, positively impactful, and innovative > > open source software. While Yahoo has taken a number of steps to advance > > its various open source projects, we believe the DataSketches library > > project is a great fit for the Apache Software Foundation due to its focus > > on data processing and its relationships to existing ASF projects. > > > > === Risk: Cryptography === > > > > DataSketches does not contain any cryptographic code and is not a > > cryptographic product. > > > > == Documentation == > > > > The following documentation is relevant to this proposal. Relevant portions > > of the documentation will be contributed to the Apache DataSketches > > project. > > > > * DataSketches website: https://datasketches.github.io. > > > > * DataSketches website repository: > > https://github.com/DataSketches/DataSketches.github.io > > > > We will need an apache website for this documentation similar to > > > > * https://datasketches.apache.org > > > > == Initial Source == > > > > The initial source for DataSketches which we will submit to the Apache > > Foundation will include a number of repositories which are currently hosted > > under the GitHub.com/datasketches organization: > > > > All github.com/datasketches repositories including: > > > > * Java > > * sketches-core: This repository has the core sketching classes, which > > are leveraged by some of the other repositories. This repository has no > > external dependencies outside of the DataSketches/memory repository, Java > > and TestNG for unit tests. This code is versioned and the latest release > > can be obtained from Maven Central. > > * memory: Low level, high-performance memory data-structure management > > primarily for off-heap. > > * sketches-android: This is a new repository dedicated to sketches > > designed to be run in a mobile client, such as a cell phone. It is still in > > development and should be considered experimental. > > * sketches-hive: This repository contains Hive UDFs and UDAFs for use > > within Hadoop grid environments. This code has dependencies on > > sketches-core as well as Hadoop and Hive. Users of this code are advised to > > use Maven to bring in all the required dependencies. This code is versioned > > and the latest release can be obtained from Maven Central. > > * sketches-pig: This repository contains Pig User Defined Functions > > (UDF) for use within Hadoop grid environments. This code has dependencies > > on sketches-core as well as Hadoop and Pig. Users of this code are advised > > to use Maven to bring in all the required dependencies. This code is > > versioned and the latest release can be obtained from Maven Central. > > * sketches-vector: This is a new repository dedicated to sketches for > > vector and matrix operations. It is still somewhat experimental. > > * characterization: This relatively new repository is for code that we > > use to characterize the accuracy and speed performance of the sketches in > > the library and is constantly being updated. Examples of the job command > > files used for various tests can be found in the src/main/resources > > directory. Some of these tests can run for hours depending on its > > configuration. > > * experimental: This repository is an experimental staging area for code > > that will eventually end up in another repository. This code is not > > versioned and not registered with Maven Central. > > * sketches-misc: Demos and other code not related to production > > deployment > > > > * C++ and Python > > * sketches-core-cpp: This is the C++/Python companion to the Java > > sketches-core. These implementations are binary compatible with their > > counterparts in Java. In other words, a sketch created and stored in C++ > > can be opened and read in Java and visa-versa. This site also has our > > Python adaptors that basically wrap the C++ implementations, making the > > high performance C++ implementations available from Python. > > * sketches-postgres: This site provides the postgres-specific adaptors > > that wrap the C++ implementations making them available to the Postgres > > database users. > > * characterization-cpp: This is the C++/Python companion to the Java > > characterization repository. > > * experimental-cpp: This repository is an experimental staging area for > > C++ code that will eventually end up in another repository. > > > > * Command-Line Tools > > * sketches-cmd > > * homebrew-sketches > > * homebrew-sketches-cmd > > > > These projects have always been Apache 2.0 licensed. We intend to bundle > > all of these repositories since they are all complementary and should be > > maintained in one project. Prior to our submission, we will combine all of > > these projects into a new git repository. > > > > == Source and Intellectual Property Submission Plan == > > > > Contributors to the DataSketches project have also signed the Yahoo > > Individual Contributor License Agreement (https://yahoocla.herokuapp.com/ > > in order to contribute to the project. > > > > With respect to trademark rights, Yahoo does not hold a trademark on the > > phrase “DataSketches.” Based on feedback and guidance we receive during the > > incubation process, we are open to renaming the project if necessary for > > trademark or other concerns, but we would prefer not to have to do that. > > > > == External Dependencies == > > > > All external dependencies are licensed under an Apache 2.0 or > > Apache-compatible license. As we grow the DataSketches community we will > > configure our build process to require and validate all contributions and > > dependencies are licensed under the Apache 2.0 license or are under an > > Apache-compatible license. > > > > == Required Resources == > > > > === Mailing Lists === > > > > We currently use a mix of mailing lists. We will migrate our existing > > mailing lists to the following: > > > > * d...@datasketches.incubator.apache.org > > > > * u...@datasketches.incubator.apache.org > > > > * priv...@datasketches.incubator.apache.org > > > > * comm...@datasketches.incubator.apache.org > > > > === Source Control === > > > > The DataSketches team currently uses Git and would like to continue to do > > so. We request a Git repository for DataSketches with mirroring to GitHub > > enabled similar the following: > > > > * https://github.com/apache/incubator-datasketches.git > > > > === Issue Tracking === > > > > We request the creation of an Apache-hosted JIRA. The DataSketches project > > is currently using the public GitHub issue tracker and the public Google > > Groups forum/sketches-user for issue tracking and discussions. We will > > migrate and combine from these two sources to the Apache JIRA. > > > > Proposed Jira ID: DATASKETCHES > > > > == Initial Committers == > > > > The following list of individuals have been extremely active in our > > community and should have write (commit) permissions to the repository. > > > > * Eshcar Hillel [eshcar at verizonmedia dot com] > > > > * Kevin Lang [langk at verizonmedia dot com] > > > > * Roman Leventov [roman.leventov at c.metamarkets dot com] > > > > * Edo Liberty [libertye at amazon dot com] > > > > * Jon Malkin [jmalkin at verizonmedia dot com] > > > > * Lee Rhodes [lrhodes at verizonmedia dot com] & [leerho > > at gmail dot com] > > > > * Alexander Saydakov [saydakov at verizonmedia dot com] > > > > * Justin Thaler [justin.thaler at georgetown dot edu] > > > > == Affiliations == > > > > The initial committers are from four organizations: Yahoo, Amazon, > > Georgetown University, and Metamarkets/Snap. > > > > === Champion === > > (Recommended to me: ) > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at apache > > dot org] > > Jean-Baptiste Onofré,[[jb at nanthrax dot net] > > > > === Nominated Mentors === > > (Recommended to me: ) > > > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at apache > > dot org] > > Jean-Baptiste Onofré, jb at nanthrax dot net > > Gil Yehuda, gyehuda at verizonmedia dot com > > > > === Sponsoring Entity === > > > > * The Apache Incubator **** This is our 1st choice **** > > > > * Apache Druid. The incubating Apache Druid project might also be a logical > > sponsor. However, DataSketches has applications in many areas of computing > > outside of Druid so our preference and recommendation is that DataSketches > > would ultimately be a top-level Apache project. > > > > ________________ > > [1] In 2017 Verizon acquired Yahoo and merged it with previously acquired > > AOL. The merged entity was originally called Oath, Inc., but has recently > > been renamed Verizon Media, Inc., a wholly-owned subsidiary of Verizon, > > Inc. Since Yahoo is the more recognized name, references in this document > > to Yahoo, are also a reference to Verizon Media, Inc. > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <k...@apache.org> wrote: > > > > > The subject line has me interested already. Follow examples like this > > > maybe? > > > > > > 1. > > > > > > > > https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E > > > 2. > > > > > > > > https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E > > > > > > Kenn > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <lee...@gmail.com> wrote: > > > > > > > I'll try again ... :) > > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <ted.dunn...@gmail.com> > > > wrote: > > > > > > > >> It didn't make it again > > > >> > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <lee...@gmail.com> wrote: > > > >> > > > >> > I'm not sure the attached document made it through. > > > >> > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <lee...@gmail.com> wrote: > > > >> > > > > >> > > > > > >> > > > > > >> > > > > >> > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org