Discussions have died down, and I think the consensus from the responses
is that the issues are 1) the lack of committers and 2) the lack of a
champion and mentors. We hope to address #1 and grow the community as
part of incubation. Is anyone interested in being a champion or mentor
and help us with #2?

Thanks,
- Steve

On 07/26/2017 04:06 PM, Chris Mattmann wrote:
> This sounds like a very interesting project. 
> 
> I don’t have the time to mentor at the moment but I will keep a close eye on 
> it.
> 
> Cheers,
> Chris Mattmann
> 
> 
> 
> 
> On 7/25/17, 11:53 AM, "McHenry, Kenton Guadron" <mche...@illinois.edu> wrote:
> 
>     Hi Dave,
>     
>     The developers that were at NCSA have moved on to other organizations.  
> While we still leverage Daffodil and are very much interested in seeing it 
> move forward, development is currently done by the Tresys team.  Agreed on 
> the synergy with Tika.
>     
>     Kenton McHenry, Ph.D.
>     Principal Research Scientist, Adjunct Assistant Professor of Computer 
> Science
>     Deputy Director of the Scientific Software & Applications Division
>     National Center for Supercomputing Applications, University of Illinois 
> at Urbana-Champaign
>     
>     On Jul 24, 2017, at 1:55 PM, Dave Fisher 
> <dave2w...@comcast.net<mailto:dave2w...@comcast.net>> wrote:
>     
>     Hi Kenton,
>     
>     Is there any reason that you and others from the NCSA are not Initial 
> Committers? That would make this proposal stronger.
>     
>     Regarding Apache Tika - it relies on other projects including Apache POI 
> and Apache PDFBox. They are pragmatic about what is used. If Daffodil works 
> to expand then I think that there would be good synergy between the projects. 
> I know as a POI PMC member that the POI community has significantly benefited 
> from the Tika community some of whom are from Mitre.
>     
>     To date Tika has not emphasized structured data, although they do extract 
> content from Excel and OpenOffice.
>     
>     I am intrigued.
>     
>     Regards,
>     Dave
>     
>     On Jul 24, 2017, at 10:55 AM, McHenry, Kenton Guadron 
> <mche...@illinois.edu<mailto:mche...@illinois.edu>> wrote:
>     
>     Yes, DFDL and its open source implementation Daffodil are more about file 
> formats and getting access to the entirety of a file's contents in a 
> consistent way through machine readable specifications.  The work has 
> implications in the area of digital preservation allowing one to preserve 
> these machine readable specifications rather than all the tools needed to 
> open/save a file in order to work with it.  Imagine someone developing 
> graphics software to work with 3D models and not having to worry about the 
> hundreds of formats out there for 3D meshes (whether there are tools for 
> opening the files and whether they can get access to those tools, whether the 
> spec is available and worrying about how complex that spec is to implement, 
> etc.), and simply building their code around the contents (e.g. vertices, 
> faces, etc.).  One could come up with similar scenarios for other data types 
> (documents, images, videos, audio, depth data, numeric data).  Ideally tools 
> built supporting DFDL, could someday, support any format for that type 
> without the developer having to worry about the details of how that data is 
> represented within a file.
>     
>     Kenton McHenry, Ph.D.
>     Principal Research Scientist, Adjunct Assistant Professor of Computer 
> Science
>     Deputy Director of the Scientific Software & Applications Division
>     National Center for Supercomputing Applications, University of Illinois 
> at Urbana-Champaign
>     
>     On Jul 24, 2017, at 10:30 AM, Steve Lawrence 
> <stephen.d.lawre...@gmail.com<mailto:stephen.d.lawre...@gmail.com><mailto:stephen.d.lawre...@gmail.com>>
>  wrote:
>     
>     I'll preface this saying that I don't have a ton of experience with
>     Apache Tika. But based on my understanding, Tika and Daffodil do have
>     somewhat similar goals, but reach them in different ways. For example,
>     Tika requires that one writes /code/ to perform data extraction, usually
>     relying on existing Java libraries to extract the desired metadata. The
>     downside to this is that code can be buggy, and libraries might not even
>     exist for formats of interest (especially common with legacy and
>     military data).
>     
>     Daffodil, on the other hand, does not require one to write any code.
>     Instead, one writes a DFDL Schema (similar to XML Schema, with DFDL
>     annotations) that fully describes the data, which Daffodil then uses to
>     convert the data to XML/JSON for extraction. So adding support for a new
>     format means writing a new schema rather than new code. And less code
>     generally means less bugs. Also, for secure systems that require
>     certification, generally speaking, it is easier to certify a schema as
>     compared to code.
>     
>     We certainly don't believe that Daffodil could replace Tika, but it does
>     have the potential to add new functionality to Tika for formats that do
>     not have existing libraries. One of our goals is to look into
>     integrating Daffodil support into tools like Tika. We'd love to hear
>     from Tika devs if this is something they'd be interested in.
>     
>     I'll also add that whereas Tika tends to focus primarily on metadata,
>     DFDL schemas usually describe an entire file format down to the byte, so
>     one can extract more than just meta data, including text and binary
>     data. Further differentiating, Daffodil has support for serializing data
>     (called unparse) from the XML/JSON representation, allowing one to
>     transform or filter data as well. We don't believe this feature is all
>     that applicable to Tika, but may be useful to other technologies such as
>     filtering or data fuzzing technologies.
>     
>     - Steve
>     
>     
>     On 07/24/2017 10:59 AM, Mike Drob wrote:
>     What is the relationship between Daffodil and something like Apache Tika's
>     extraction engine?
>     
>     On Mon, Jul 24, 2017 at 9:53 AM, Steve Lawrence <
>     
> stephen.d.lawre...@gmail.com<mailto:stephen.d.lawre...@gmail.com><mailto:stephen.d.lawre...@gmail.com>>
>  wrote:
>     
>     Dear Apache Incubator Community,
>     
>     We would like to start a discussion around a proposal to bring Daffodil
>     into the Apache Incubator. Daffodil is a implementation of the DFDL
>     specification used to convert between fixed format data and XML/JSON.
>     
>     The draft proposal can be found in the wiki at the following URL:
>     
>     https://wiki.apache.org/incubator/DaffodilProposal
>     
>     We do not yet have a champion or mentors, but it was recommended that we
>     create a proposal and send it to this list to potentially find those
>     that might be interested. The text for the draft proposal is found
>     below. We look forward to your input.
>     
>     Thanks,
>     -Steve
>     
>     
>     = Daffodil Proposal =
>     
>     == Abstract ==
>     
>     Daffodil is an implementation of the Data Format Description Language
>     (DFDL) used to convert between fixed format data and XML/JSON.
>     
>     == Proposal ==
>     
>     The Data Format Description Language (DFDL) is a specification,
>     developed by the Open Grid Forum, capable of describing many data
>     formats, including both textual and binary, scientific and numeric,
>     legacy and modern, commercial record-oriented, and many industry and
>     military standards. It defines a language that is a subset of W3C XML
>     schema to describe the logical format of the data, and annotations
>     within the schema to describe the physical representation.
>     
>     Daffodil is an open source implementation of the DFDL specification that
>     uses these DFDL schemas to parse fixed format data into an infoset,
>     which is most commonly represented as either XML or JSON. This allows
>     the use of well-established XML or JSON technologies and libraries to
>     consume, inspect, and manipulate fixed format data in existing
>     solutions. Daffodil is also capable of the reverse by serializing or
>     "unparsing" an XML or JSON infoset back to the original data format.
>     
>     == Background ==
>     
>     Many different software solutions need to consume and manage data,
>     including data directed routing, databases, data analysis, data
>     cleansing, data visualizing, and more. A key aspect of such solutions is
>     the need to transform the data into an easily consumable format.
>     Usually, this means that for each unique data format, one develops a
>     tool that can read and extract the necessary information, often leading
>     to ad-hoc and data-format-specific description systems. Such systems are
>     often proprietary, not well tested, and incompatible, leading to vendor
>     lock-in, flawed software, and increased training costs. DFDL is a new
>     standard, with version 1.0 completed in October of 2016, that solves
>     these problems by defining an open standard to describe many different
>     data formats and how to parse and unparse between the data and XML/JSON.
>     
>     Two closed source implementations of DFDL currently exist. The first was
>     created by IBM and is now part of their IBM® Integration Bus product.
>     The second was created by the European Space Agency, called DFDL4S or
>     "DFDL for Space" targeted at the challenges of their satellite data
>     processing.
>     
>     Around 2005, Pacific Northwest National Lab created Defuddle, built as
>     an open source implementation and proof of concept of the draft DFDL
>     specification and a test bed to feed new concepts into specification
>     development. Primary development of Defuddle was eventually taken over
>     by the National Center for Supercomputing Applications (NCSA). However,
>     due to evolution of the DFDL specification and architectural and
>     performance issues with Defuddle, around 2009, NCSA restarted the
>     project with the new name of Daffodil, with a goal of implementing the
>     complete DFDL specification. Daffodil development continued at NCSA
>     until around 2012, at which point development slowed due to budget
>     limitations. Shortly thereafter, primary development was picked up by
>     Tresys Technology where it continues today, with contributions from
>     other entities such as the Navy Research Lab, the Air Force Research
>     Lab, MITRE, and Booz Allen Hamilton. In February of 2015, Daffodil
>     version 1.0.0 was released, including support for the DFDL features
>     needed to parse many common file formats. Daffodil version 2.0.0 is
>     expected to be released in August of 2017, which will include unparse
>     support with one-to-one parsing feature parity.
>     
>     Entities including IBM, MITRE, NATO NCI Agency, Northrop-Grumman, Quark
>     Security, Raytheon, and Tresys Technology have developed DFDL schemas
>     for many data formats from varying technology domains, including PNG,
>     GIF, BMP, PCAP, HL7, EDIFACT, NACHA, vCard, iCalendar, and MIL-STD-2045,
>     many of which are publicly available on the DFDL Schemas github. There
>     are also a number of military-application data formats, the
>     specifications of which are not public, which have historically been
>     very difficult and expensive to process, and for which DFDL schemas have
>     been created or are actively in development; these include
>     MIL-STD-6040/USMTF ATO, MIL-STD-6017/VMF, MIL-STD-6016/NATO STANAG 5516
>     (aka "Link16").
>     
>     == Rationale ==
>     
>     Numerous software solutions exist that consume, inspect, analyze, and
>     transform data, many of which can be found in the Apache Software
>     Foundation (ASF). In order for tools like these to consume new types of
>     data, custom extensions are usually required, often with high
>     development and testing costs. Daffodil fills a clear gap in many of
>     these solutions, providing a simple and low cost way to transform data
>     to XML or JSON, which many of these tools natively support already. With
>     the upcoming 2.0.0 release, the Daffodil project will have achieved a
>     level of functionality in both parse and unparse that, when integrated
>     into existing solutions, could provide for a new method to quickly
>     enable support for new data formats.
>     
>     == Initial Goals ==
>     
>     * Relicense the existing code from the University of Illinois/NCSA Open
>     Source License to the Apache License version 2.0, working with Apache
>     Legal to ensure correctness, and with Daffodil contributors to get
>     their permission.
>     * Move the existing codebase, documentation, bugs, and mailing lists to
>     the Apache hosted infrastructure
>     * Establish a formal release process and schedule, allowing for
>     dependable release cycles in a manner consistent with the Apache
>     development process.
>     * Build relationships with ASF projects to add Daffodil support where
>     appropriate
>     * Grow the community to establish a diversity of background and expertise.
>     
>     == Current Status ==
>     
>     === Meritocracy ===
>     
>     All initial committers are familiar with the principles of meritocracy.
>     The Daffodil project has followed the model of meritocracy in the past,
>     providing multiple outside entities commit access based on the quality
>     of their contributions. In order to grow the Daffodil user base and
>     development community, we are dedicated to continuing to operate
>     Daffodil as a meritocracy.
>     
>     A key ingredient in a meritocracy of developers is open group code
>     review. The Daffodil project has operated in this mode throughout its
>     existence and this provides a forum to improve the code, verify code
>     quality, and educate new developers on the code base.
>     
>     === Community ===
>     
>     Daffodil has a small community of users and developers. Although primary
>     Daffodil development is done by Tresys Technology, a handful of other
>     contributions have come from other entities including the Navy Research
>     Lab, the Air Force Research Lab, MITRE, and Booz Allen Hamilton. In
>     addition to developers, multiple users of Daffodil have created DFDL
>     schemas, including entities such as MITRE, IBM, Raytheon, Quark
>     Security, and Tresys Technology. The DFDL Schemas github community has
>     been created as a place for DFDL schemas to be published. The Daffodil
>     project also makes use of mailing lists, !HipChat, and Confluence
>     Questions to build a community of users and system for support.
>     
>     === Core Developers ===
>     
>     The core developers of Daffodil are employed by Tresys Technology. We
>     will work to grow the community among a more diverse set of developers
>     and industries.
>     
>     === Alignment ===
>     
>     Daffodil was created as an open source project with a philosophy
>     consistent with The Apache Way. A strong belief in meritocracy,
>     community involvement in decisions, openness, and ensuring a high level
>     of quality in code, documentation, and testing are some of our shared
>     core beliefs.
>     
>     Further, as mentioned in the Rationale section, Daffodil fills a gap
>     that exists in many ASF projects, including !NiFi, Spark, Storm, Hadoop,
>     Tika, and others. In order for tools like these to consume new types of
>     data, custom extensions are usually required. Rather than create such
>     extensions, Daffodil provides an easy and standards-compliant way to
>     transform data to XML or JSON, which many of these tools already
>     natively support.
>     
>     == Known Risks ==
>     
>     === Orphaned Products ===
>     
>     The current core developers are the leading contributors in the space of
>     DFDL and wish to see it flourish. Though there is some risk that the
>     initial committers all come from the same company, a goal of entering
>     into incubation is to grow the development community to minimize the
>     risk of reliance on a single company.
>     
>     === Inexperience with Open Source ===
>     
>     The Daffodil project began as an open source project and has continued
>     that model throughout development. This includes public bug tracking,
>     git revision control, automated builds and tests, and a public wiki for
>     documentation.
>     
>     Additionally, the current core developers and initial committers all
>     work for a company that relies on, believes in, promotes, and has led or
>     contributed to many open source software projects, including SELinux
>     Userspace, OpenSCAP, CLIP, refpolicy, setools, RPM, and others. As such,
>     there is low risk related to inexperience with open source software and
>     processes.
>     
>     === Homogeneous Developers ===
>     
>     The proposed initial committers come from a single entity, though we are
>     committed to growing the Daffodil development community to include a
>     broad group of additional committers from a wide array of industries.
>     
>     === Reliance on Salaried Developers ===
>     
>     The proposed initial committers are paid by their employer to contribute
>     to the Daffodil project. We expect that Daffodil development will
>     continue with salaried developers, and are committed to growing the
>     community to include non-salaried developers as well.
>     
>     === Relationship with other Apache Projects ===
>     
>     As mentioned in the Alignment section, Daffodil fills a clear gap in
>     numerous other ASF projects that consume and manage large amounts of data.
>     
>     As a specific example, Daffodil developers have created a Daffodil
>     Apache !NiFi Processor, currently in use in data transfer solutions,
>     which allows one to ingest non-native data into an Apache !NiFi pipeline
>     as XML or JSON. This processor was well received by the Apache !NiFi
>     developers, with positive comments about the concise API and how it
>     could handle non-native data. Daffodil developers have also successfully
>     prototyped integration with Apache Spark. We believe Daffodil could
>     provide a strong benefit to many other ASF projects that handle fixed
>     format data. We anticipate working closely with such ASF projects to
>     include Daffodil where applicable to increase their ability to support
>     new data formats with minimal effort.
>     
>     Daffodil also depends on existing ASF projects, including Apache Commons
>     and Apache Xerces.
>     
>     === An Excessive Fascination with the Apache Brand ===
>     
>     Although the Apache brand may certainly help to attract more
>     contributors, publicity is not the reason for this proposal. We believe
>     Daffodil could provide a great benefit to the ASF and the numerous data
>     focused projects that comprise it, as described in the Rationale and
>     Alignment sections. We hope to build a strong and vibrant community
>     built around The Apache Way, and not dependent on a single company.
>     
>     === Documentation ===
>     
>     Daffodil documentation can be found at:
>     
>     *
>     https://opensource.ncsa.illinois.edu/confluence/
>     display/DFDL/Daffodil%3A+Open+Source+DFDL
>     
>     Information about DFDL can be found at:
>     
>     * https://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
>     *
>     https://www.ibm.com/support/knowledgecenter/en/SSMKHH_9.0.
>     0/com.ibm.etools.mft.doc/df20060_.htm
>     
>     Public examples of DFDL Schemas can be found at:
>     
>     * https://github.com/DFDLSchemas
>     
>     == Initial Source ==
>     
>     The Daffodil git repo goes back to mid-2011 with approximately 20
>     different contributors and feedback from many users and developers. The
>     core codebase is written in Scala and includes both a Scala and Java
>     API, along with Javadocs and Scaladocs for API usage. The initial code
>     will come from the git repository currently hosted by NCSA at the
>     University of Illinois :
>     
>     https://opensource.ncsa.illinois.edu/bitbucket/
>     projects/DFDL/repos/daffodil/
>     
>     == Source and Intellectual Property Submission ==
>     
>     The complete Daffodil code is licensed under the University of
>     Illinois/NCSA Open Source License. Much of the current codebase has been
>     developed by Tresys Technology, who is open to relicensing the code to
>     the Apache License version 2.0 and donate the source to the ASF.
>     Contacts at NCSA are also open to relicensing their contributions to
>     Apache v2. We plan to contact the other contributors and ask for
>     permission to relicense and donate their contributed code. For those
>     that decline or we cannot contact, their code will be removed or
>     replaced. We will work closely with Apache Legal to ensure all issues
>     related to relicensing are acceptable.
>     
>     == External Dependencies ==
>     
>     We believe all current dependencies are compatible with the ASF
>     guidelines. Our dependency licenses come from the following license
>     styles: Apache v2, BSD, MIT, and ICU. The list of current Daffodil
>     dependencies and their licenses are documented here:
>     
>     https://opensource.ncsa.illinois.edu/confluence/
>     display/DFDL/Dependencies+and+Licenses
>     
>     == Cryptography ==
>     
>     None
>     
>     == Required Resources ==
>     
>     === Mailing Lists ===
>     
>     * comm...@daffodil.incubator.apache.org
>     * d...@daffodil.incubator.apache.org
>     * priv...@daffodil.incubator.apache.org
>     * u...@daffodil.incubator.apache.org
>     
>     === Source Control ===
>     
>     git://git.apache.org/incubator-daffodil.git
>     
>     === Issue Tracking ===
>     
>     JIRA Daffodil (DFDL)
>     
>     === Initial Committers ===
>     
>     * Beth Finnegan <efinnegan at tresys dot com>
>     * Dave Thompson <dthompson at tresys dot com>
>     * Josh Adams <jadams at tresys dot com>
>     * Mike Beckerle <mbeckerle at tresys dot com>
>     * Steve Lawrence <slawrence at tresys dot com>
>     * Taylor Wise <twise at tresys dot com>
>     
>     === Affiliations ===
>     
>     * Beth Finnegan (Tresys Technology)
>     * Dave Thompson (Tresys Technology)
>     * Josh Adams (Tresys Technology)
>     * Mike Beckerle (Tresys Technology)
>     * Steve Lawrence (Tresys Technology)
>     * Taylor Wise (Tresys Technology)
>     
>     == Sponsors ==
>     
>     === Champion ===
>     
>     * TBD
>     
>     === Nominated Mentors ===
>     
>     * TBD
>     
>     === Sponsoring Entity ===
>     
>     We request the Apache Incubator to sponsor this project.
>     
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>     For additional commands, e-mail: general-h...@incubator.apache.org
>     
>     
>     
>     
>     
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: 
> general-unsubscr...@incubator.apache.org<mailto:general-unsubscr...@incubator.apache.org>
>     For additional commands, e-mail: 
> general-h...@incubator.apache.org<mailto:general-h...@incubator.apache.org>
>     
>     
>     
>     
>     
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>

Reply via email to