Discussions have died down, and I think the consensus from the responses is that the issues are 1) the lack of committers and 2) the lack of a champion and mentors. We hope to address #1 and grow the community as part of incubation. Is anyone interested in being a champion or mentor and help us with #2?
Thanks, - Steve On 07/26/2017 04:06 PM, Chris Mattmann wrote: > This sounds like a very interesting project. > > I don’t have the time to mentor at the moment but I will keep a close eye on > it. > > Cheers, > Chris Mattmann > > > > > On 7/25/17, 11:53 AM, "McHenry, Kenton Guadron" <mche...@illinois.edu> wrote: > > Hi Dave, > > The developers that were at NCSA have moved on to other organizations. > While we still leverage Daffodil and are very much interested in seeing it > move forward, development is currently done by the Tresys team. Agreed on > the synergy with Tika. > > Kenton McHenry, Ph.D. > Principal Research Scientist, Adjunct Assistant Professor of Computer > Science > Deputy Director of the Scientific Software & Applications Division > National Center for Supercomputing Applications, University of Illinois > at Urbana-Champaign > > On Jul 24, 2017, at 1:55 PM, Dave Fisher > <dave2w...@comcast.net<mailto:dave2w...@comcast.net>> wrote: > > Hi Kenton, > > Is there any reason that you and others from the NCSA are not Initial > Committers? That would make this proposal stronger. > > Regarding Apache Tika - it relies on other projects including Apache POI > and Apache PDFBox. They are pragmatic about what is used. If Daffodil works > to expand then I think that there would be good synergy between the projects. > I know as a POI PMC member that the POI community has significantly benefited > from the Tika community some of whom are from Mitre. > > To date Tika has not emphasized structured data, although they do extract > content from Excel and OpenOffice. > > I am intrigued. > > Regards, > Dave > > On Jul 24, 2017, at 10:55 AM, McHenry, Kenton Guadron > <mche...@illinois.edu<mailto:mche...@illinois.edu>> wrote: > > Yes, DFDL and its open source implementation Daffodil are more about file > formats and getting access to the entirety of a file's contents in a > consistent way through machine readable specifications. The work has > implications in the area of digital preservation allowing one to preserve > these machine readable specifications rather than all the tools needed to > open/save a file in order to work with it. Imagine someone developing > graphics software to work with 3D models and not having to worry about the > hundreds of formats out there for 3D meshes (whether there are tools for > opening the files and whether they can get access to those tools, whether the > spec is available and worrying about how complex that spec is to implement, > etc.), and simply building their code around the contents (e.g. vertices, > faces, etc.). One could come up with similar scenarios for other data types > (documents, images, videos, audio, depth data, numeric data). Ideally tools > built supporting DFDL, could someday, support any format for that type > without the developer having to worry about the details of how that data is > represented within a file. > > Kenton McHenry, Ph.D. > Principal Research Scientist, Adjunct Assistant Professor of Computer > Science > Deputy Director of the Scientific Software & Applications Division > National Center for Supercomputing Applications, University of Illinois > at Urbana-Champaign > > On Jul 24, 2017, at 10:30 AM, Steve Lawrence > <stephen.d.lawre...@gmail.com<mailto:stephen.d.lawre...@gmail.com><mailto:stephen.d.lawre...@gmail.com>> > wrote: > > I'll preface this saying that I don't have a ton of experience with > Apache Tika. But based on my understanding, Tika and Daffodil do have > somewhat similar goals, but reach them in different ways. For example, > Tika requires that one writes /code/ to perform data extraction, usually > relying on existing Java libraries to extract the desired metadata. The > downside to this is that code can be buggy, and libraries might not even > exist for formats of interest (especially common with legacy and > military data). > > Daffodil, on the other hand, does not require one to write any code. > Instead, one writes a DFDL Schema (similar to XML Schema, with DFDL > annotations) that fully describes the data, which Daffodil then uses to > convert the data to XML/JSON for extraction. So adding support for a new > format means writing a new schema rather than new code. And less code > generally means less bugs. Also, for secure systems that require > certification, generally speaking, it is easier to certify a schema as > compared to code. > > We certainly don't believe that Daffodil could replace Tika, but it does > have the potential to add new functionality to Tika for formats that do > not have existing libraries. One of our goals is to look into > integrating Daffodil support into tools like Tika. We'd love to hear > from Tika devs if this is something they'd be interested in. > > I'll also add that whereas Tika tends to focus primarily on metadata, > DFDL schemas usually describe an entire file format down to the byte, so > one can extract more than just meta data, including text and binary > data. Further differentiating, Daffodil has support for serializing data > (called unparse) from the XML/JSON representation, allowing one to > transform or filter data as well. We don't believe this feature is all > that applicable to Tika, but may be useful to other technologies such as > filtering or data fuzzing technologies. > > - Steve > > > On 07/24/2017 10:59 AM, Mike Drob wrote: > What is the relationship between Daffodil and something like Apache Tika's > extraction engine? > > On Mon, Jul 24, 2017 at 9:53 AM, Steve Lawrence < > > stephen.d.lawre...@gmail.com<mailto:stephen.d.lawre...@gmail.com><mailto:stephen.d.lawre...@gmail.com>> > wrote: > > Dear Apache Incubator Community, > > We would like to start a discussion around a proposal to bring Daffodil > into the Apache Incubator. Daffodil is a implementation of the DFDL > specification used to convert between fixed format data and XML/JSON. > > The draft proposal can be found in the wiki at the following URL: > > https://wiki.apache.org/incubator/DaffodilProposal > > We do not yet have a champion or mentors, but it was recommended that we > create a proposal and send it to this list to potentially find those > that might be interested. The text for the draft proposal is found > below. We look forward to your input. > > Thanks, > -Steve > > > = Daffodil Proposal = > > == Abstract == > > Daffodil is an implementation of the Data Format Description Language > (DFDL) used to convert between fixed format data and XML/JSON. > > == Proposal == > > The Data Format Description Language (DFDL) is a specification, > developed by the Open Grid Forum, capable of describing many data > formats, including both textual and binary, scientific and numeric, > legacy and modern, commercial record-oriented, and many industry and > military standards. It defines a language that is a subset of W3C XML > schema to describe the logical format of the data, and annotations > within the schema to describe the physical representation. > > Daffodil is an open source implementation of the DFDL specification that > uses these DFDL schemas to parse fixed format data into an infoset, > which is most commonly represented as either XML or JSON. This allows > the use of well-established XML or JSON technologies and libraries to > consume, inspect, and manipulate fixed format data in existing > solutions. Daffodil is also capable of the reverse by serializing or > "unparsing" an XML or JSON infoset back to the original data format. > > == Background == > > Many different software solutions need to consume and manage data, > including data directed routing, databases, data analysis, data > cleansing, data visualizing, and more. A key aspect of such solutions is > the need to transform the data into an easily consumable format. > Usually, this means that for each unique data format, one develops a > tool that can read and extract the necessary information, often leading > to ad-hoc and data-format-specific description systems. Such systems are > often proprietary, not well tested, and incompatible, leading to vendor > lock-in, flawed software, and increased training costs. DFDL is a new > standard, with version 1.0 completed in October of 2016, that solves > these problems by defining an open standard to describe many different > data formats and how to parse and unparse between the data and XML/JSON. > > Two closed source implementations of DFDL currently exist. The first was > created by IBM and is now part of their IBM® Integration Bus product. > The second was created by the European Space Agency, called DFDL4S or > "DFDL for Space" targeted at the challenges of their satellite data > processing. > > Around 2005, Pacific Northwest National Lab created Defuddle, built as > an open source implementation and proof of concept of the draft DFDL > specification and a test bed to feed new concepts into specification > development. Primary development of Defuddle was eventually taken over > by the National Center for Supercomputing Applications (NCSA). However, > due to evolution of the DFDL specification and architectural and > performance issues with Defuddle, around 2009, NCSA restarted the > project with the new name of Daffodil, with a goal of implementing the > complete DFDL specification. Daffodil development continued at NCSA > until around 2012, at which point development slowed due to budget > limitations. Shortly thereafter, primary development was picked up by > Tresys Technology where it continues today, with contributions from > other entities such as the Navy Research Lab, the Air Force Research > Lab, MITRE, and Booz Allen Hamilton. In February of 2015, Daffodil > version 1.0.0 was released, including support for the DFDL features > needed to parse many common file formats. Daffodil version 2.0.0 is > expected to be released in August of 2017, which will include unparse > support with one-to-one parsing feature parity. > > Entities including IBM, MITRE, NATO NCI Agency, Northrop-Grumman, Quark > Security, Raytheon, and Tresys Technology have developed DFDL schemas > for many data formats from varying technology domains, including PNG, > GIF, BMP, PCAP, HL7, EDIFACT, NACHA, vCard, iCalendar, and MIL-STD-2045, > many of which are publicly available on the DFDL Schemas github. There > are also a number of military-application data formats, the > specifications of which are not public, which have historically been > very difficult and expensive to process, and for which DFDL schemas have > been created or are actively in development; these include > MIL-STD-6040/USMTF ATO, MIL-STD-6017/VMF, MIL-STD-6016/NATO STANAG 5516 > (aka "Link16"). > > == Rationale == > > Numerous software solutions exist that consume, inspect, analyze, and > transform data, many of which can be found in the Apache Software > Foundation (ASF). In order for tools like these to consume new types of > data, custom extensions are usually required, often with high > development and testing costs. Daffodil fills a clear gap in many of > these solutions, providing a simple and low cost way to transform data > to XML or JSON, which many of these tools natively support already. With > the upcoming 2.0.0 release, the Daffodil project will have achieved a > level of functionality in both parse and unparse that, when integrated > into existing solutions, could provide for a new method to quickly > enable support for new data formats. > > == Initial Goals == > > * Relicense the existing code from the University of Illinois/NCSA Open > Source License to the Apache License version 2.0, working with Apache > Legal to ensure correctness, and with Daffodil contributors to get > their permission. > * Move the existing codebase, documentation, bugs, and mailing lists to > the Apache hosted infrastructure > * Establish a formal release process and schedule, allowing for > dependable release cycles in a manner consistent with the Apache > development process. > * Build relationships with ASF projects to add Daffodil support where > appropriate > * Grow the community to establish a diversity of background and expertise. > > == Current Status == > > === Meritocracy === > > All initial committers are familiar with the principles of meritocracy. > The Daffodil project has followed the model of meritocracy in the past, > providing multiple outside entities commit access based on the quality > of their contributions. In order to grow the Daffodil user base and > development community, we are dedicated to continuing to operate > Daffodil as a meritocracy. > > A key ingredient in a meritocracy of developers is open group code > review. The Daffodil project has operated in this mode throughout its > existence and this provides a forum to improve the code, verify code > quality, and educate new developers on the code base. > > === Community === > > Daffodil has a small community of users and developers. Although primary > Daffodil development is done by Tresys Technology, a handful of other > contributions have come from other entities including the Navy Research > Lab, the Air Force Research Lab, MITRE, and Booz Allen Hamilton. In > addition to developers, multiple users of Daffodil have created DFDL > schemas, including entities such as MITRE, IBM, Raytheon, Quark > Security, and Tresys Technology. The DFDL Schemas github community has > been created as a place for DFDL schemas to be published. The Daffodil > project also makes use of mailing lists, !HipChat, and Confluence > Questions to build a community of users and system for support. > > === Core Developers === > > The core developers of Daffodil are employed by Tresys Technology. We > will work to grow the community among a more diverse set of developers > and industries. > > === Alignment === > > Daffodil was created as an open source project with a philosophy > consistent with The Apache Way. A strong belief in meritocracy, > community involvement in decisions, openness, and ensuring a high level > of quality in code, documentation, and testing are some of our shared > core beliefs. > > Further, as mentioned in the Rationale section, Daffodil fills a gap > that exists in many ASF projects, including !NiFi, Spark, Storm, Hadoop, > Tika, and others. In order for tools like these to consume new types of > data, custom extensions are usually required. Rather than create such > extensions, Daffodil provides an easy and standards-compliant way to > transform data to XML or JSON, which many of these tools already > natively support. > > == Known Risks == > > === Orphaned Products === > > The current core developers are the leading contributors in the space of > DFDL and wish to see it flourish. Though there is some risk that the > initial committers all come from the same company, a goal of entering > into incubation is to grow the development community to minimize the > risk of reliance on a single company. > > === Inexperience with Open Source === > > The Daffodil project began as an open source project and has continued > that model throughout development. This includes public bug tracking, > git revision control, automated builds and tests, and a public wiki for > documentation. > > Additionally, the current core developers and initial committers all > work for a company that relies on, believes in, promotes, and has led or > contributed to many open source software projects, including SELinux > Userspace, OpenSCAP, CLIP, refpolicy, setools, RPM, and others. As such, > there is low risk related to inexperience with open source software and > processes. > > === Homogeneous Developers === > > The proposed initial committers come from a single entity, though we are > committed to growing the Daffodil development community to include a > broad group of additional committers from a wide array of industries. > > === Reliance on Salaried Developers === > > The proposed initial committers are paid by their employer to contribute > to the Daffodil project. We expect that Daffodil development will > continue with salaried developers, and are committed to growing the > community to include non-salaried developers as well. > > === Relationship with other Apache Projects === > > As mentioned in the Alignment section, Daffodil fills a clear gap in > numerous other ASF projects that consume and manage large amounts of data. > > As a specific example, Daffodil developers have created a Daffodil > Apache !NiFi Processor, currently in use in data transfer solutions, > which allows one to ingest non-native data into an Apache !NiFi pipeline > as XML or JSON. This processor was well received by the Apache !NiFi > developers, with positive comments about the concise API and how it > could handle non-native data. Daffodil developers have also successfully > prototyped integration with Apache Spark. We believe Daffodil could > provide a strong benefit to many other ASF projects that handle fixed > format data. We anticipate working closely with such ASF projects to > include Daffodil where applicable to increase their ability to support > new data formats with minimal effort. > > Daffodil also depends on existing ASF projects, including Apache Commons > and Apache Xerces. > > === An Excessive Fascination with the Apache Brand === > > Although the Apache brand may certainly help to attract more > contributors, publicity is not the reason for this proposal. We believe > Daffodil could provide a great benefit to the ASF and the numerous data > focused projects that comprise it, as described in the Rationale and > Alignment sections. We hope to build a strong and vibrant community > built around The Apache Way, and not dependent on a single company. > > === Documentation === > > Daffodil documentation can be found at: > > * > https://opensource.ncsa.illinois.edu/confluence/ > display/DFDL/Daffodil%3A+Open+Source+DFDL > > Information about DFDL can be found at: > > * https://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl > * > https://www.ibm.com/support/knowledgecenter/en/SSMKHH_9.0. > 0/com.ibm.etools.mft.doc/df20060_.htm > > Public examples of DFDL Schemas can be found at: > > * https://github.com/DFDLSchemas > > == Initial Source == > > The Daffodil git repo goes back to mid-2011 with approximately 20 > different contributors and feedback from many users and developers. The > core codebase is written in Scala and includes both a Scala and Java > API, along with Javadocs and Scaladocs for API usage. The initial code > will come from the git repository currently hosted by NCSA at the > University of Illinois : > > https://opensource.ncsa.illinois.edu/bitbucket/ > projects/DFDL/repos/daffodil/ > > == Source and Intellectual Property Submission == > > The complete Daffodil code is licensed under the University of > Illinois/NCSA Open Source License. Much of the current codebase has been > developed by Tresys Technology, who is open to relicensing the code to > the Apache License version 2.0 and donate the source to the ASF. > Contacts at NCSA are also open to relicensing their contributions to > Apache v2. We plan to contact the other contributors and ask for > permission to relicense and donate their contributed code. For those > that decline or we cannot contact, their code will be removed or > replaced. We will work closely with Apache Legal to ensure all issues > related to relicensing are acceptable. > > == External Dependencies == > > We believe all current dependencies are compatible with the ASF > guidelines. Our dependency licenses come from the following license > styles: Apache v2, BSD, MIT, and ICU. The list of current Daffodil > dependencies and their licenses are documented here: > > https://opensource.ncsa.illinois.edu/confluence/ > display/DFDL/Dependencies+and+Licenses > > == Cryptography == > > None > > == Required Resources == > > === Mailing Lists === > > * comm...@daffodil.incubator.apache.org > * d...@daffodil.incubator.apache.org > * priv...@daffodil.incubator.apache.org > * u...@daffodil.incubator.apache.org > > === Source Control === > > git://git.apache.org/incubator-daffodil.git > > === Issue Tracking === > > JIRA Daffodil (DFDL) > > === Initial Committers === > > * Beth Finnegan <efinnegan at tresys dot com> > * Dave Thompson <dthompson at tresys dot com> > * Josh Adams <jadams at tresys dot com> > * Mike Beckerle <mbeckerle at tresys dot com> > * Steve Lawrence <slawrence at tresys dot com> > * Taylor Wise <twise at tresys dot com> > > === Affiliations === > > * Beth Finnegan (Tresys Technology) > * Dave Thompson (Tresys Technology) > * Josh Adams (Tresys Technology) > * Mike Beckerle (Tresys Technology) > * Steve Lawrence (Tresys Technology) > * Taylor Wise (Tresys Technology) > > == Sponsors == > > === Champion === > > * TBD > > === Nominated Mentors === > > * TBD > > === Sponsoring Entity === > > We request the Apache Incubator to sponsor this project. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: > general-unsubscr...@incubator.apache.org<mailto:general-unsubscr...@incubator.apache.org> > For additional commands, e-mail: > general-h...@incubator.apache.org<mailto:general-h...@incubator.apache.org> > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org >