Woot! +1 for druid incubation.
-h On Fri, Feb 16, 2018 at 12:15 PM, Gian Merlino <g...@apache.org> wrote: > Hi all, > > I would like to open up a discussion about incubating Druid at Apache. I've > included a proposal in this mail and have also posted a draft at > https://wiki.apache.org/incubator/DruidProposal. More information about > Druid is also available on our project web site at: http://druid.io/ > > Thanks for your consideration! > > Gian > > = Druid Proposal = > > == Abstract == > > Druid is a high-performance, column-oriented, distributed data store. > > == Proposal == > > Druid is an open source data store designed for real-time exploratory > analytics on large data sets. Druid's key features are a column-oriented > storage layout, a distributed shared-nothing architecture, and ability to > generate and leverage indexing and caching structures. Druid is typically > deployed in clusters of tens to hundreds of nodes, and has the ability to > load data from Apache Kafka and Apache Hadoop, among other data sources. > Druid offers two query languages: a SQL dialect (powered by Apache Calcite) > and a JSON-over-HTTP API. > > Druid was originally developed to power a slice-and-dice analytical UI > built on top of large event streams. The original use case for Druid > targeted ingest rates of millions of records/sec, retention of over a year > of data, and query latencies of sub-second to a few seconds. Many people > can benefit from such capability, and many already have (see > http://druid.io/druid-powered.html). In addition, new use cases have > emerged since Druid's original development, such as OLAP acceleration of > data warehouse tables and more highly concurrent applications operating > with relatively narrower queries. > > == Background == > > Druid is a data store designed for fast analytics. It would typically be > used in lieu of more general purpose query systems like Hadoop !MapReduce > or Spark when query latency is of the utmost importance. Druid is often > used as a data store for powering GUI analytical applications. > > The buzzwordy description of Druid is a high-performance, column-oriented, > distributed data store. What we mean by this is: > > * "high performance": Druid aims to provide low query latency and high > ingest rates possible. > * "column-oriented": Druid stores data in a column-oriented format, like > most other systems designed for analytics. It can also store indexes along > with the columns. > * "distributed": Druid is deployed in clusters, typically of tens to > hundreds of nodes. > * "data store": Druid loads your data and stores a copy of it on the > cluster's local disks (and may cache it in memory). It doesn't query your > data from some other storage system. > > == Rationale == > > Druid is a mature, active project with a large number of production > installations, dozens of contributors to each release, and multiple vendors > offering professional support. Given Druid's strong community, its close > integration with many other Apache projects (such as Kafka, Hadoop, and > Calcite), and its pre-existing Apache-inspired governance structure, we > feel that Apache is the best home for the project on a long-term basis. > > == Current Status == > > === Meritocracy === > Since Druid was first open sourced the original developers have solicited > contributions from others, including through our blog, the project mailing > lists, and through accepting !GitHub pull requests. We have an > Apache-inspired governance structure with a PMC and committers, and our > committer ranks include a good number of people from outside the original > development team. > > === Community === > > The Druid core developers have sought to nurture a community throughout the > life of the project. We use !GitHub as the focal point for bug reports and > code contributions, and the mailing lists for most other discussion. To try > to make people feel welcome, we've also spelled this out on a "CONTRIBUTE" > link from the project page: http://druid.io/community/. Today we have an > active contributor base (a typical release has ~40 contributors) and > mailing list. > > === Core Developers === > > Druid enjoys good diversity of committer affiliation. The most active > developers over the past year are affiliated with four different companies: > Imply, Metamarkets, Yahoo, and Hortonworks. Many Druid committers are also > committers on other ASF projects as well, including Apache Airflow, Apache > Curator, and Apache Calcite. The original developers of Druid remain > involved in the project. > > === Alignment === > > Druid's current governance structure is Apache-inspired with a PMC and > committers chosen by a meritocratic process. Additionally, Druid integrates > with a number of other Apache projects, including Kafka, Hadoop, Hive, > Calcite, Superset (incubating), Spark, Curator, and !ZooKeeper. > > == Known Risks == > > === Orphaned products === > > The risk of Druid becoming orphaned is low, due to a diverse committer base > that is invested in the future of the project. > > === Inexperience with Open Source === > > Druid's core developers have been running it as a community-oriented open > source project for some time now, and many of them are committers on other > open source projects as well, including Apache Airflow, Apache Curator, and > Apache Calcite. > > === Homogenous Developers === > > Druid's current diversity of committer affiliation means that we have > become accustomed to working collaboratively and in the open. We hope that > a transition to the ASF helps Druid's contributor base become even more > diverse. > > === Reliance on Salaried Developers === > > Druid's user base and contributor base skews heavily towards salaried > developers. We believe this is natural since Druid is a technology designed > to be deployed on large clusters, and due to this, tends to be deployed by > organizations rather than by individuals. Nevertheless, many current Druid > developers have continued working on the project even through job changes, > which we take to be a good sign of developer commitment and personal > interest. > > === Relationships with Other Apache Products === > > Druid integrates with a number of other Apache projects. Druid internally > uses Calcite for SQL planning, and Curator and !ZooKeeper for coordination. > Druid can read data in Avro or Parquet format. Druid can load data from > streams in Kafka or from files in Hadoop. Druid integrates with Hive as an > option for SQL query acceleration. Druid data can be visualized by Superset > (incubating). > > === A Excessive Fascination with the Apache Brand === > > Druid is a successful project with a diverse community. The main reason for > pursuing incubation is to find a stable, long term home for the project > with a well known governance philosophy. > > == Required Resources == > > === Mailing lists === > > We would like to migrate the existing Druid mailing lists from Google > Groups to Apache. > > * druid-user@googlegroups -> us...@druid.incubator.apache.org > * druid-development@googlegroups -> d...@druid.incubator.apache.org > > === Source control === > > Druid development currently takes place on !GitHub. We would like to > continue using !GitHub, if possible, in order to preserve the workflows the > community has developed around !GitHub pull requests. > > === Issue tracking === > Druid currently uses !GitHub issues for issue tracking. We would like to > migrate to Apache JIRA at http://issues.apache.org/jira/browse/DRUID. > > == Documentation == > > Druid's documentation can be found at http://druid.io/docs/latest/. > > == Initial Source == > > Druid was initially open-sourced by Metamarkets in 2012 and has been run in > a community-governed fashion since then. The code is currently hosted at > https://github.com/druid-io/ and includes the following repositories: > > * druid (primary repository) > * druid-console (web console for Druid) > * druid-io.github.io (source for Druid's website at http://druid.io/) > * tranquility (realtime stream push client for Druid) > * docker-druid (Docker image for Druid) > * pydruid (Python library) > * RDruid (R library) > * oss-parent (Maven POM files) > > == Source and Intellectual Property Submission Plan == > > A complete set of the open source code needs to be licensed from the owning > organization to the Foundation. Commercial legal counsel for the owning > organization will review the standard Foundation licensing paperwork and > propose any updates as needed. This license will enable Apache to incubate > and manage the Druid project moving forward. > > Other Druid paraphernalia to be transferred to Apache consists of: > > * !GitHub organization at https://github.com/druid-io/ > * Twitter account at https://twitter.com/druidio > * "druid.io" domain name > * "Druid" trademark assignment per Foundation standard paper. The > trademark assignment paperwork shall be reviewed by the owning > organization's commercial and IP counsel > * CLAs - all rights in the code licensed above should encompass the CLAs > that existed between developers and owning organization > > A copyright license to the code, trademark assignment of Druid, and > transfer of other paraphernalia to Apache should be sufficient to cover all > rights required by Apache to operate the project. > > == External Dependencies == > External dependencies distributed with Druid currently all have one of the > following Category A or B licenses: ASL, BSD, CDDL, EPL, MIT, MPL; with one > exception: the optional Druid MySQL metadata store extension depends on > MySQL Connector/J, which is GPL licensed. Druid currently packages this as > a separate download; see our current presentation on: > http://druid.io/downloads.html. As part of incubation we intend to > determine the best strategy for handling the MySQL extension. > > == Cryptography == > Not applicable. > > == Initial Committers == > > The initial committers for incubation are the current set of committers on > Druid who have expressed interest in being involved in Apache incubation. > Affiliations are listed where relevant. We may seek to add other committers > during incubation; for example, we would want to add any current Druid > committers who express an interest after incubation begins. > > * Charles Allen (char...@allen-net.com) (Snap) > * David Lim (david.clarence....@gmail.com) (Imply) > * Eric Tschetter (ched...@apache.org) (Splunk) > * Fangjin Yang (f...@imply.io) (Imply) > * Gian Merlino (g...@apache.org) (Imply) > * Himanshu Gupta (g.himan...@gmail.com) (Oath) > * Jihoon Son (jihoon...@apache.org) (Imply) > * Jonathan Wei (jon....@imply.io) (Imply) > * Maxime Beauchemin (maximebeauche...@gmail.com) (Lyft) > * Mohamed Slim Bouguerra (slim.bougue...@gmail.com) (Hortonworks) > * Nishant Bangarwa (nish...@apache.org) (Hortonworks) > * Parag Jain (paragjai...@gmail.com) (Oath) > * Roman Leventov (leventov...@gmail.com) (Metamarkets) > * Xavier Léauté (xav...@leaute.com) (Confluent) > > == Sponsors == > > * Champion: Julian Hyde > * Nominated mentors: Julian Hyde, P. Taylor Goetz, Jun Rao > * Sponsoring entity: Apache Incubator >