Hi Steve, It was not so much the lack of committers as it was the current diversity. That is not a blocker for entry to Incubation.
I am willing to be one of the Mentors. Once there are at least two more we can push forward. Regards, Dave > On Aug 1, 2017, at 5:09 AM, Steve Lawrence <stephen.d.lawre...@gmail.com> > wrote: > > Discussions have died down, and I think the consensus from the responses > is that the issues are 1) the lack of committers and 2) the lack of a > champion and mentors. We hope to address #1 and grow the community as > part of incubation. Is anyone interested in being a champion or mentor > and help us with #2? > > Thanks, > - Steve > > On 07/26/2017 04:06 PM, Chris Mattmann wrote: >> This sounds like a very interesting project. >> >> I don’t have the time to mentor at the moment but I will keep a close eye on >> it. >> >> Cheers, >> Chris Mattmann >> >> >> >> >> On 7/25/17, 11:53 AM, "McHenry, Kenton Guadron" <mche...@illinois.edu> wrote: >> >> Hi Dave, >> >> The developers that were at NCSA have moved on to other organizations. >> While we still leverage Daffodil and are very much interested in seeing it >> move forward, development is currently done by the Tresys team. Agreed on >> the synergy with Tika. >> >> Kenton McHenry, Ph.D. >> Principal Research Scientist, Adjunct Assistant Professor of Computer >> Science >> Deputy Director of the Scientific Software & Applications Division >> National Center for Supercomputing Applications, University of Illinois >> at Urbana-Champaign >> >> On Jul 24, 2017, at 1:55 PM, Dave Fisher >> <dave2w...@comcast.net<mailto:dave2w...@comcast.net>> wrote: >> >> Hi Kenton, >> >> Is there any reason that you and others from the NCSA are not Initial >> Committers? That would make this proposal stronger. >> >> Regarding Apache Tika - it relies on other projects including Apache POI >> and Apache PDFBox. They are pragmatic about what is used. If Daffodil works >> to expand then I think that there would be good synergy between the >> projects. I know as a POI PMC member that the POI community has >> significantly benefited from the Tika community some of whom are from Mitre. >> >> To date Tika has not emphasized structured data, although they do extract >> content from Excel and OpenOffice. >> >> I am intrigued. >> >> Regards, >> Dave >> >> On Jul 24, 2017, at 10:55 AM, McHenry, Kenton Guadron >> <mche...@illinois.edu<mailto:mche...@illinois.edu>> wrote: >> >> Yes, DFDL and its open source implementation Daffodil are more about file >> formats and getting access to the entirety of a file's contents in a >> consistent way through machine readable specifications. The work has >> implications in the area of digital preservation allowing one to preserve >> these machine readable specifications rather than all the tools needed to >> open/save a file in order to work with it. Imagine someone developing >> graphics software to work with 3D models and not having to worry about the >> hundreds of formats out there for 3D meshes (whether there are tools for >> opening the files and whether they can get access to those tools, whether >> the spec is available and worrying about how complex that spec is to >> implement, etc.), and simply building their code around the contents (e.g. >> vertices, faces, etc.). One could come up with similar scenarios for other >> data types (documents, images, videos, audio, depth data, numeric data). >> Ideally tools built supporting DFDL, could someday, support any format for >> that type without the developer having to worry about the details of how >> that data is represented within a file. >> >> Kenton McHenry, Ph.D. >> Principal Research Scientist, Adjunct Assistant Professor of Computer >> Science >> Deputy Director of the Scientific Software & Applications Division >> National Center for Supercomputing Applications, University of Illinois >> at Urbana-Champaign >> >> On Jul 24, 2017, at 10:30 AM, Steve Lawrence >> <stephen.d.lawre...@gmail.com<mailto:stephen.d.lawre...@gmail.com><mailto:stephen.d.lawre...@gmail.com>> >> wrote: >> >> I'll preface this saying that I don't have a ton of experience with >> Apache Tika. But based on my understanding, Tika and Daffodil do have >> somewhat similar goals, but reach them in different ways. For example, >> Tika requires that one writes /code/ to perform data extraction, usually >> relying on existing Java libraries to extract the desired metadata. The >> downside to this is that code can be buggy, and libraries might not even >> exist for formats of interest (especially common with legacy and >> military data). >> >> Daffodil, on the other hand, does not require one to write any code. >> Instead, one writes a DFDL Schema (similar to XML Schema, with DFDL >> annotations) that fully describes the data, which Daffodil then uses to >> convert the data to XML/JSON for extraction. So adding support for a new >> format means writing a new schema rather than new code. And less code >> generally means less bugs. Also, for secure systems that require >> certification, generally speaking, it is easier to certify a schema as >> compared to code. >> >> We certainly don't believe that Daffodil could replace Tika, but it does >> have the potential to add new functionality to Tika for formats that do >> not have existing libraries. One of our goals is to look into >> integrating Daffodil support into tools like Tika. We'd love to hear >> from Tika devs if this is something they'd be interested in. >> >> I'll also add that whereas Tika tends to focus primarily on metadata, >> DFDL schemas usually describe an entire file format down to the byte, so >> one can extract more than just meta data, including text and binary >> data. Further differentiating, Daffodil has support for serializing data >> (called unparse) from the XML/JSON representation, allowing one to >> transform or filter data as well. We don't believe this feature is all >> that applicable to Tika, but may be useful to other technologies such as >> filtering or data fuzzing technologies. >> >> - Steve >> >> >> On 07/24/2017 10:59 AM, Mike Drob wrote: >> What is the relationship between Daffodil and something like Apache Tika's >> extraction engine? >> >> On Mon, Jul 24, 2017 at 9:53 AM, Steve Lawrence < >> >> stephen.d.lawre...@gmail.com<mailto:stephen.d.lawre...@gmail.com><mailto:stephen.d.lawre...@gmail.com>> >> wrote: >> >> Dear Apache Incubator Community, >> >> We would like to start a discussion around a proposal to bring Daffodil >> into the Apache Incubator. Daffodil is a implementation of the DFDL >> specification used to convert between fixed format data and XML/JSON. >> >> The draft proposal can be found in the wiki at the following URL: >> >> https://wiki.apache.org/incubator/DaffodilProposal >> >> We do not yet have a champion or mentors, but it was recommended that we >> create a proposal and send it to this list to potentially find those >> that might be interested. The text for the draft proposal is found >> below. We look forward to your input. >> >> Thanks, >> -Steve >> >> >> = Daffodil Proposal = >> >> == Abstract == >> >> Daffodil is an implementation of the Data Format Description Language >> (DFDL) used to convert between fixed format data and XML/JSON. >> >> == Proposal == >> >> The Data Format Description Language (DFDL) is a specification, >> developed by the Open Grid Forum, capable of describing many data >> formats, including both textual and binary, scientific and numeric, >> legacy and modern, commercial record-oriented, and many industry and >> military standards. It defines a language that is a subset of W3C XML >> schema to describe the logical format of the data, and annotations >> within the schema to describe the physical representation. >> >> Daffodil is an open source implementation of the DFDL specification that >> uses these DFDL schemas to parse fixed format data into an infoset, >> which is most commonly represented as either XML or JSON. This allows >> the use of well-established XML or JSON technologies and libraries to >> consume, inspect, and manipulate fixed format data in existing >> solutions. Daffodil is also capable of the reverse by serializing or >> "unparsing" an XML or JSON infoset back to the original data format. >> >> == Background == >> >> Many different software solutions need to consume and manage data, >> including data directed routing, databases, data analysis, data >> cleansing, data visualizing, and more. A key aspect of such solutions is >> the need to transform the data into an easily consumable format. >> Usually, this means that for each unique data format, one develops a >> tool that can read and extract the necessary information, often leading >> to ad-hoc and data-format-specific description systems. Such systems are >> often proprietary, not well tested, and incompatible, leading to vendor >> lock-in, flawed software, and increased training costs. DFDL is a new >> standard, with version 1.0 completed in October of 2016, that solves >> these problems by defining an open standard to describe many different >> data formats and how to parse and unparse between the data and XML/JSON. >> >> Two closed source implementations of DFDL currently exist. The first was >> created by IBM and is now part of their IBM® Integration Bus product. >> The second was created by the European Space Agency, called DFDL4S or >> "DFDL for Space" targeted at the challenges of their satellite data >> processing. >> >> Around 2005, Pacific Northwest National Lab created Defuddle, built as >> an open source implementation and proof of concept of the draft DFDL >> specification and a test bed to feed new concepts into specification >> development. Primary development of Defuddle was eventually taken over >> by the National Center for Supercomputing Applications (NCSA). However, >> due to evolution of the DFDL specification and architectural and >> performance issues with Defuddle, around 2009, NCSA restarted the >> project with the new name of Daffodil, with a goal of implementing the >> complete DFDL specification. Daffodil development continued at NCSA >> until around 2012, at which point development slowed due to budget >> limitations. Shortly thereafter, primary development was picked up by >> Tresys Technology where it continues today, with contributions from >> other entities such as the Navy Research Lab, the Air Force Research >> Lab, MITRE, and Booz Allen Hamilton. In February of 2015, Daffodil >> version 1.0.0 was released, including support for the DFDL features >> needed to parse many common file formats. Daffodil version 2.0.0 is >> expected to be released in August of 2017, which will include unparse >> support with one-to-one parsing feature parity. >> >> Entities including IBM, MITRE, NATO NCI Agency, Northrop-Grumman, Quark >> Security, Raytheon, and Tresys Technology have developed DFDL schemas >> for many data formats from varying technology domains, including PNG, >> GIF, BMP, PCAP, HL7, EDIFACT, NACHA, vCard, iCalendar, and MIL-STD-2045, >> many of which are publicly available on the DFDL Schemas github. There >> are also a number of military-application data formats, the >> specifications of which are not public, which have historically been >> very difficult and expensive to process, and for which DFDL schemas have >> been created or are actively in development; these include >> MIL-STD-6040/USMTF ATO, MIL-STD-6017/VMF, MIL-STD-6016/NATO STANAG 5516 >> (aka "Link16"). >> >> == Rationale == >> >> Numerous software solutions exist that consume, inspect, analyze, and >> transform data, many of which can be found in the Apache Software >> Foundation (ASF). In order for tools like these to consume new types of >> data, custom extensions are usually required, often with high >> development and testing costs. Daffodil fills a clear gap in many of >> these solutions, providing a simple and low cost way to transform data >> to XML or JSON, which many of these tools natively support already. With >> the upcoming 2.0.0 release, the Daffodil project will have achieved a >> level of functionality in both parse and unparse that, when integrated >> into existing solutions, could provide for a new method to quickly >> enable support for new data formats. >> >> == Initial Goals == >> >> * Relicense the existing code from the University of Illinois/NCSA Open >> Source License to the Apache License version 2.0, working with Apache >> Legal to ensure correctness, and with Daffodil contributors to get >> their permission. >> * Move the existing codebase, documentation, bugs, and mailing lists to >> the Apache hosted infrastructure >> * Establish a formal release process and schedule, allowing for >> dependable release cycles in a manner consistent with the Apache >> development process. >> * Build relationships with ASF projects to add Daffodil support where >> appropriate >> * Grow the community to establish a diversity of background and expertise. >> >> == Current Status == >> >> === Meritocracy === >> >> All initial committers are familiar with the principles of meritocracy. >> The Daffodil project has followed the model of meritocracy in the past, >> providing multiple outside entities commit access based on the quality >> of their contributions. In order to grow the Daffodil user base and >> development community, we are dedicated to continuing to operate >> Daffodil as a meritocracy. >> >> A key ingredient in a meritocracy of developers is open group code >> review. The Daffodil project has operated in this mode throughout its >> existence and this provides a forum to improve the code, verify code >> quality, and educate new developers on the code base. >> >> === Community === >> >> Daffodil has a small community of users and developers. Although primary >> Daffodil development is done by Tresys Technology, a handful of other >> contributions have come from other entities including the Navy Research >> Lab, the Air Force Research Lab, MITRE, and Booz Allen Hamilton. In >> addition to developers, multiple users of Daffodil have created DFDL >> schemas, including entities such as MITRE, IBM, Raytheon, Quark >> Security, and Tresys Technology. The DFDL Schemas github community has >> been created as a place for DFDL schemas to be published. The Daffodil >> project also makes use of mailing lists, !HipChat, and Confluence >> Questions to build a community of users and system for support. >> >> === Core Developers === >> >> The core developers of Daffodil are employed by Tresys Technology. We >> will work to grow the community among a more diverse set of developers >> and industries. >> >> === Alignment === >> >> Daffodil was created as an open source project with a philosophy >> consistent with The Apache Way. A strong belief in meritocracy, >> community involvement in decisions, openness, and ensuring a high level >> of quality in code, documentation, and testing are some of our shared >> core beliefs. >> >> Further, as mentioned in the Rationale section, Daffodil fills a gap >> that exists in many ASF projects, including !NiFi, Spark, Storm, Hadoop, >> Tika, and others. In order for tools like these to consume new types of >> data, custom extensions are usually required. Rather than create such >> extensions, Daffodil provides an easy and standards-compliant way to >> transform data to XML or JSON, which many of these tools already >> natively support. >> >> == Known Risks == >> >> === Orphaned Products === >> >> The current core developers are the leading contributors in the space of >> DFDL and wish to see it flourish. Though there is some risk that the >> initial committers all come from the same company, a goal of entering >> into incubation is to grow the development community to minimize the >> risk of reliance on a single company. >> >> === Inexperience with Open Source === >> >> The Daffodil project began as an open source project and has continued >> that model throughout development. This includes public bug tracking, >> git revision control, automated builds and tests, and a public wiki for >> documentation. >> >> Additionally, the current core developers and initial committers all >> work for a company that relies on, believes in, promotes, and has led or >> contributed to many open source software projects, including SELinux >> Userspace, OpenSCAP, CLIP, refpolicy, setools, RPM, and others. As such, >> there is low risk related to inexperience with open source software and >> processes. >> >> === Homogeneous Developers === >> >> The proposed initial committers come from a single entity, though we are >> committed to growing the Daffodil development community to include a >> broad group of additional committers from a wide array of industries. >> >> === Reliance on Salaried Developers === >> >> The proposed initial committers are paid by their employer to contribute >> to the Daffodil project. We expect that Daffodil development will >> continue with salaried developers, and are committed to growing the >> community to include non-salaried developers as well. >> >> === Relationship with other Apache Projects === >> >> As mentioned in the Alignment section, Daffodil fills a clear gap in >> numerous other ASF projects that consume and manage large amounts of data. >> >> As a specific example, Daffodil developers have created a Daffodil >> Apache !NiFi Processor, currently in use in data transfer solutions, >> which allows one to ingest non-native data into an Apache !NiFi pipeline >> as XML or JSON. This processor was well received by the Apache !NiFi >> developers, with positive comments about the concise API and how it >> could handle non-native data. Daffodil developers have also successfully >> prototyped integration with Apache Spark. We believe Daffodil could >> provide a strong benefit to many other ASF projects that handle fixed >> format data. We anticipate working closely with such ASF projects to >> include Daffodil where applicable to increase their ability to support >> new data formats with minimal effort. >> >> Daffodil also depends on existing ASF projects, including Apache Commons >> and Apache Xerces. >> >> === An Excessive Fascination with the Apache Brand === >> >> Although the Apache brand may certainly help to attract more >> contributors, publicity is not the reason for this proposal. We believe >> Daffodil could provide a great benefit to the ASF and the numerous data >> focused projects that comprise it, as described in the Rationale and >> Alignment sections. We hope to build a strong and vibrant community >> built around The Apache Way, and not dependent on a single company. >> >> === Documentation === >> >> Daffodil documentation can be found at: >> >> * >> https://opensource.ncsa.illinois.edu/confluence/ >> display/DFDL/Daffodil%3A+Open+Source+DFDL >> >> Information about DFDL can be found at: >> >> * https://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl >> * >> https://www.ibm.com/support/knowledgecenter/en/SSMKHH_9.0. >> 0/com.ibm.etools.mft.doc/df20060_.htm >> >> Public examples of DFDL Schemas can be found at: >> >> * https://github.com/DFDLSchemas >> >> == Initial Source == >> >> The Daffodil git repo goes back to mid-2011 with approximately 20 >> different contributors and feedback from many users and developers. The >> core codebase is written in Scala and includes both a Scala and Java >> API, along with Javadocs and Scaladocs for API usage. The initial code >> will come from the git repository currently hosted by NCSA at the >> University of Illinois : >> >> https://opensource.ncsa.illinois.edu/bitbucket/ >> projects/DFDL/repos/daffodil/ >> >> == Source and Intellectual Property Submission == >> >> The complete Daffodil code is licensed under the University of >> Illinois/NCSA Open Source License. Much of the current codebase has been >> developed by Tresys Technology, who is open to relicensing the code to >> the Apache License version 2.0 and donate the source to the ASF. >> Contacts at NCSA are also open to relicensing their contributions to >> Apache v2. We plan to contact the other contributors and ask for >> permission to relicense and donate their contributed code. For those >> that decline or we cannot contact, their code will be removed or >> replaced. We will work closely with Apache Legal to ensure all issues >> related to relicensing are acceptable. >> >> == External Dependencies == >> >> We believe all current dependencies are compatible with the ASF >> guidelines. Our dependency licenses come from the following license >> styles: Apache v2, BSD, MIT, and ICU. The list of current Daffodil >> dependencies and their licenses are documented here: >> >> https://opensource.ncsa.illinois.edu/confluence/ >> display/DFDL/Dependencies+and+Licenses >> >> == Cryptography == >> >> None >> >> == Required Resources == >> >> === Mailing Lists === >> >> * comm...@daffodil.incubator.apache.org >> * d...@daffodil.incubator.apache.org >> * priv...@daffodil.incubator.apache.org >> * u...@daffodil.incubator.apache.org >> >> === Source Control === >> >> git://git.apache.org/incubator-daffodil.git >> >> === Issue Tracking === >> >> JIRA Daffodil (DFDL) >> >> === Initial Committers === >> >> * Beth Finnegan <efinnegan at tresys dot com> >> * Dave Thompson <dthompson at tresys dot com> >> * Josh Adams <jadams at tresys dot com> >> * Mike Beckerle <mbeckerle at tresys dot com> >> * Steve Lawrence <slawrence at tresys dot com> >> * Taylor Wise <twise at tresys dot com> >> >> === Affiliations === >> >> * Beth Finnegan (Tresys Technology) >> * Dave Thompson (Tresys Technology) >> * Josh Adams (Tresys Technology) >> * Mike Beckerle (Tresys Technology) >> * Steve Lawrence (Tresys Technology) >> * Taylor Wise (Tresys Technology) >> >> == Sponsors == >> >> === Champion === >> >> * TBD >> >> === Nominated Mentors === >> >> * TBD >> >> === Sponsoring Entity === >> >> We request the Apache Incubator to sponsor this project. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: >> general-unsubscr...@incubator.apache.org<mailto:general-unsubscr...@incubator.apache.org> >> For additional commands, e-mail: >> general-h...@incubator.apache.org<mailto:general-h...@incubator.apache.org> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org
signature.asc
Description: Message signed with OpenPGP