This sounds like a very interesting project. 

I don’t have the time to mentor at the moment but I will keep a close eye on it.

Cheers,
Chris Mattmann




On 7/25/17, 11:53 AM, "McHenry, Kenton Guadron" <mche...@illinois.edu> wrote:

    Hi Dave,
    
    The developers that were at NCSA have moved on to other organizations.  
While we still leverage Daffodil and are very much interested in seeing it move 
forward, development is currently done by the Tresys team.  Agreed on the 
synergy with Tika.
    
    Kenton McHenry, Ph.D.
    Principal Research Scientist, Adjunct Assistant Professor of Computer 
Science
    Deputy Director of the Scientific Software & Applications Division
    National Center for Supercomputing Applications, University of Illinois at 
Urbana-Champaign
    
    On Jul 24, 2017, at 1:55 PM, Dave Fisher 
<dave2w...@comcast.net<mailto:dave2w...@comcast.net>> wrote:
    
    Hi Kenton,
    
    Is there any reason that you and others from the NCSA are not Initial 
Committers? That would make this proposal stronger.
    
    Regarding Apache Tika - it relies on other projects including Apache POI 
and Apache PDFBox. They are pragmatic about what is used. If Daffodil works to 
expand then I think that there would be good synergy between the projects. I 
know as a POI PMC member that the POI community has significantly benefited 
from the Tika community some of whom are from Mitre.
    
    To date Tika has not emphasized structured data, although they do extract 
content from Excel and OpenOffice.
    
    I am intrigued.
    
    Regards,
    Dave
    
    On Jul 24, 2017, at 10:55 AM, McHenry, Kenton Guadron 
<mche...@illinois.edu<mailto:mche...@illinois.edu>> wrote:
    
    Yes, DFDL and its open source implementation Daffodil are more about file 
formats and getting access to the entirety of a file's contents in a consistent 
way through machine readable specifications.  The work has implications in the 
area of digital preservation allowing one to preserve these machine readable 
specifications rather than all the tools needed to open/save a file in order to 
work with it.  Imagine someone developing graphics software to work with 3D 
models and not having to worry about the hundreds of formats out there for 3D 
meshes (whether there are tools for opening the files and whether they can get 
access to those tools, whether the spec is available and worrying about how 
complex that spec is to implement, etc.), and simply building their code around 
the contents (e.g. vertices, faces, etc.).  One could come up with similar 
scenarios for other data types (documents, images, videos, audio, depth data, 
numeric data).  Ideally tools built supporting DFDL, could someday, support any 
format for that type without the developer having to worry about the details of 
how that data is represented within a file.
    
    Kenton McHenry, Ph.D.
    Principal Research Scientist, Adjunct Assistant Professor of Computer 
Science
    Deputy Director of the Scientific Software & Applications Division
    National Center for Supercomputing Applications, University of Illinois at 
Urbana-Champaign
    
    On Jul 24, 2017, at 10:30 AM, Steve Lawrence 
<stephen.d.lawre...@gmail.com<mailto:stephen.d.lawre...@gmail.com><mailto:stephen.d.lawre...@gmail.com>>
 wrote:
    
    I'll preface this saying that I don't have a ton of experience with
    Apache Tika. But based on my understanding, Tika and Daffodil do have
    somewhat similar goals, but reach them in different ways. For example,
    Tika requires that one writes /code/ to perform data extraction, usually
    relying on existing Java libraries to extract the desired metadata. The
    downside to this is that code can be buggy, and libraries might not even
    exist for formats of interest (especially common with legacy and
    military data).
    
    Daffodil, on the other hand, does not require one to write any code.
    Instead, one writes a DFDL Schema (similar to XML Schema, with DFDL
    annotations) that fully describes the data, which Daffodil then uses to
    convert the data to XML/JSON for extraction. So adding support for a new
    format means writing a new schema rather than new code. And less code
    generally means less bugs. Also, for secure systems that require
    certification, generally speaking, it is easier to certify a schema as
    compared to code.
    
    We certainly don't believe that Daffodil could replace Tika, but it does
    have the potential to add new functionality to Tika for formats that do
    not have existing libraries. One of our goals is to look into
    integrating Daffodil support into tools like Tika. We'd love to hear
    from Tika devs if this is something they'd be interested in.
    
    I'll also add that whereas Tika tends to focus primarily on metadata,
    DFDL schemas usually describe an entire file format down to the byte, so
    one can extract more than just meta data, including text and binary
    data. Further differentiating, Daffodil has support for serializing data
    (called unparse) from the XML/JSON representation, allowing one to
    transform or filter data as well. We don't believe this feature is all
    that applicable to Tika, but may be useful to other technologies such as
    filtering or data fuzzing technologies.
    
    - Steve
    
    
    On 07/24/2017 10:59 AM, Mike Drob wrote:
    What is the relationship between Daffodil and something like Apache Tika's
    extraction engine?
    
    On Mon, Jul 24, 2017 at 9:53 AM, Steve Lawrence <
    
stephen.d.lawre...@gmail.com<mailto:stephen.d.lawre...@gmail.com><mailto:stephen.d.lawre...@gmail.com>>
 wrote:
    
    Dear Apache Incubator Community,
    
    We would like to start a discussion around a proposal to bring Daffodil
    into the Apache Incubator. Daffodil is a implementation of the DFDL
    specification used to convert between fixed format data and XML/JSON.
    
    The draft proposal can be found in the wiki at the following URL:
    
    https://wiki.apache.org/incubator/DaffodilProposal
    
    We do not yet have a champion or mentors, but it was recommended that we
    create a proposal and send it to this list to potentially find those
    that might be interested. The text for the draft proposal is found
    below. We look forward to your input.
    
    Thanks,
    -Steve
    
    
    = Daffodil Proposal =
    
    == Abstract ==
    
    Daffodil is an implementation of the Data Format Description Language
    (DFDL) used to convert between fixed format data and XML/JSON.
    
    == Proposal ==
    
    The Data Format Description Language (DFDL) is a specification,
    developed by the Open Grid Forum, capable of describing many data
    formats, including both textual and binary, scientific and numeric,
    legacy and modern, commercial record-oriented, and many industry and
    military standards. It defines a language that is a subset of W3C XML
    schema to describe the logical format of the data, and annotations
    within the schema to describe the physical representation.
    
    Daffodil is an open source implementation of the DFDL specification that
    uses these DFDL schemas to parse fixed format data into an infoset,
    which is most commonly represented as either XML or JSON. This allows
    the use of well-established XML or JSON technologies and libraries to
    consume, inspect, and manipulate fixed format data in existing
    solutions. Daffodil is also capable of the reverse by serializing or
    "unparsing" an XML or JSON infoset back to the original data format.
    
    == Background ==
    
    Many different software solutions need to consume and manage data,
    including data directed routing, databases, data analysis, data
    cleansing, data visualizing, and more. A key aspect of such solutions is
    the need to transform the data into an easily consumable format.
    Usually, this means that for each unique data format, one develops a
    tool that can read and extract the necessary information, often leading
    to ad-hoc and data-format-specific description systems. Such systems are
    often proprietary, not well tested, and incompatible, leading to vendor
    lock-in, flawed software, and increased training costs. DFDL is a new
    standard, with version 1.0 completed in October of 2016, that solves
    these problems by defining an open standard to describe many different
    data formats and how to parse and unparse between the data and XML/JSON.
    
    Two closed source implementations of DFDL currently exist. The first was
    created by IBM and is now part of their IBM® Integration Bus product.
    The second was created by the European Space Agency, called DFDL4S or
    "DFDL for Space" targeted at the challenges of their satellite data
    processing.
    
    Around 2005, Pacific Northwest National Lab created Defuddle, built as
    an open source implementation and proof of concept of the draft DFDL
    specification and a test bed to feed new concepts into specification
    development. Primary development of Defuddle was eventually taken over
    by the National Center for Supercomputing Applications (NCSA). However,
    due to evolution of the DFDL specification and architectural and
    performance issues with Defuddle, around 2009, NCSA restarted the
    project with the new name of Daffodil, with a goal of implementing the
    complete DFDL specification. Daffodil development continued at NCSA
    until around 2012, at which point development slowed due to budget
    limitations. Shortly thereafter, primary development was picked up by
    Tresys Technology where it continues today, with contributions from
    other entities such as the Navy Research Lab, the Air Force Research
    Lab, MITRE, and Booz Allen Hamilton. In February of 2015, Daffodil
    version 1.0.0 was released, including support for the DFDL features
    needed to parse many common file formats. Daffodil version 2.0.0 is
    expected to be released in August of 2017, which will include unparse
    support with one-to-one parsing feature parity.
    
    Entities including IBM, MITRE, NATO NCI Agency, Northrop-Grumman, Quark
    Security, Raytheon, and Tresys Technology have developed DFDL schemas
    for many data formats from varying technology domains, including PNG,
    GIF, BMP, PCAP, HL7, EDIFACT, NACHA, vCard, iCalendar, and MIL-STD-2045,
    many of which are publicly available on the DFDL Schemas github. There
    are also a number of military-application data formats, the
    specifications of which are not public, which have historically been
    very difficult and expensive to process, and for which DFDL schemas have
    been created or are actively in development; these include
    MIL-STD-6040/USMTF ATO, MIL-STD-6017/VMF, MIL-STD-6016/NATO STANAG 5516
    (aka "Link16").
    
    == Rationale ==
    
    Numerous software solutions exist that consume, inspect, analyze, and
    transform data, many of which can be found in the Apache Software
    Foundation (ASF). In order for tools like these to consume new types of
    data, custom extensions are usually required, often with high
    development and testing costs. Daffodil fills a clear gap in many of
    these solutions, providing a simple and low cost way to transform data
    to XML or JSON, which many of these tools natively support already. With
    the upcoming 2.0.0 release, the Daffodil project will have achieved a
    level of functionality in both parse and unparse that, when integrated
    into existing solutions, could provide for a new method to quickly
    enable support for new data formats.
    
    == Initial Goals ==
    
    * Relicense the existing code from the University of Illinois/NCSA Open
    Source License to the Apache License version 2.0, working with Apache
    Legal to ensure correctness, and with Daffodil contributors to get
    their permission.
    * Move the existing codebase, documentation, bugs, and mailing lists to
    the Apache hosted infrastructure
    * Establish a formal release process and schedule, allowing for
    dependable release cycles in a manner consistent with the Apache
    development process.
    * Build relationships with ASF projects to add Daffodil support where
    appropriate
    * Grow the community to establish a diversity of background and expertise.
    
    == Current Status ==
    
    === Meritocracy ===
    
    All initial committers are familiar with the principles of meritocracy.
    The Daffodil project has followed the model of meritocracy in the past,
    providing multiple outside entities commit access based on the quality
    of their contributions. In order to grow the Daffodil user base and
    development community, we are dedicated to continuing to operate
    Daffodil as a meritocracy.
    
    A key ingredient in a meritocracy of developers is open group code
    review. The Daffodil project has operated in this mode throughout its
    existence and this provides a forum to improve the code, verify code
    quality, and educate new developers on the code base.
    
    === Community ===
    
    Daffodil has a small community of users and developers. Although primary
    Daffodil development is done by Tresys Technology, a handful of other
    contributions have come from other entities including the Navy Research
    Lab, the Air Force Research Lab, MITRE, and Booz Allen Hamilton. In
    addition to developers, multiple users of Daffodil have created DFDL
    schemas, including entities such as MITRE, IBM, Raytheon, Quark
    Security, and Tresys Technology. The DFDL Schemas github community has
    been created as a place for DFDL schemas to be published. The Daffodil
    project also makes use of mailing lists, !HipChat, and Confluence
    Questions to build a community of users and system for support.
    
    === Core Developers ===
    
    The core developers of Daffodil are employed by Tresys Technology. We
    will work to grow the community among a more diverse set of developers
    and industries.
    
    === Alignment ===
    
    Daffodil was created as an open source project with a philosophy
    consistent with The Apache Way. A strong belief in meritocracy,
    community involvement in decisions, openness, and ensuring a high level
    of quality in code, documentation, and testing are some of our shared
    core beliefs.
    
    Further, as mentioned in the Rationale section, Daffodil fills a gap
    that exists in many ASF projects, including !NiFi, Spark, Storm, Hadoop,
    Tika, and others. In order for tools like these to consume new types of
    data, custom extensions are usually required. Rather than create such
    extensions, Daffodil provides an easy and standards-compliant way to
    transform data to XML or JSON, which many of these tools already
    natively support.
    
    == Known Risks ==
    
    === Orphaned Products ===
    
    The current core developers are the leading contributors in the space of
    DFDL and wish to see it flourish. Though there is some risk that the
    initial committers all come from the same company, a goal of entering
    into incubation is to grow the development community to minimize the
    risk of reliance on a single company.
    
    === Inexperience with Open Source ===
    
    The Daffodil project began as an open source project and has continued
    that model throughout development. This includes public bug tracking,
    git revision control, automated builds and tests, and a public wiki for
    documentation.
    
    Additionally, the current core developers and initial committers all
    work for a company that relies on, believes in, promotes, and has led or
    contributed to many open source software projects, including SELinux
    Userspace, OpenSCAP, CLIP, refpolicy, setools, RPM, and others. As such,
    there is low risk related to inexperience with open source software and
    processes.
    
    === Homogeneous Developers ===
    
    The proposed initial committers come from a single entity, though we are
    committed to growing the Daffodil development community to include a
    broad group of additional committers from a wide array of industries.
    
    === Reliance on Salaried Developers ===
    
    The proposed initial committers are paid by their employer to contribute
    to the Daffodil project. We expect that Daffodil development will
    continue with salaried developers, and are committed to growing the
    community to include non-salaried developers as well.
    
    === Relationship with other Apache Projects ===
    
    As mentioned in the Alignment section, Daffodil fills a clear gap in
    numerous other ASF projects that consume and manage large amounts of data.
    
    As a specific example, Daffodil developers have created a Daffodil
    Apache !NiFi Processor, currently in use in data transfer solutions,
    which allows one to ingest non-native data into an Apache !NiFi pipeline
    as XML or JSON. This processor was well received by the Apache !NiFi
    developers, with positive comments about the concise API and how it
    could handle non-native data. Daffodil developers have also successfully
    prototyped integration with Apache Spark. We believe Daffodil could
    provide a strong benefit to many other ASF projects that handle fixed
    format data. We anticipate working closely with such ASF projects to
    include Daffodil where applicable to increase their ability to support
    new data formats with minimal effort.
    
    Daffodil also depends on existing ASF projects, including Apache Commons
    and Apache Xerces.
    
    === An Excessive Fascination with the Apache Brand ===
    
    Although the Apache brand may certainly help to attract more
    contributors, publicity is not the reason for this proposal. We believe
    Daffodil could provide a great benefit to the ASF and the numerous data
    focused projects that comprise it, as described in the Rationale and
    Alignment sections. We hope to build a strong and vibrant community
    built around The Apache Way, and not dependent on a single company.
    
    === Documentation ===
    
    Daffodil documentation can be found at:
    
    *
    https://opensource.ncsa.illinois.edu/confluence/
    display/DFDL/Daffodil%3A+Open+Source+DFDL
    
    Information about DFDL can be found at:
    
    * https://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
    *
    https://www.ibm.com/support/knowledgecenter/en/SSMKHH_9.0.
    0/com.ibm.etools.mft.doc/df20060_.htm
    
    Public examples of DFDL Schemas can be found at:
    
    * https://github.com/DFDLSchemas
    
    == Initial Source ==
    
    The Daffodil git repo goes back to mid-2011 with approximately 20
    different contributors and feedback from many users and developers. The
    core codebase is written in Scala and includes both a Scala and Java
    API, along with Javadocs and Scaladocs for API usage. The initial code
    will come from the git repository currently hosted by NCSA at the
    University of Illinois :
    
    https://opensource.ncsa.illinois.edu/bitbucket/
    projects/DFDL/repos/daffodil/
    
    == Source and Intellectual Property Submission ==
    
    The complete Daffodil code is licensed under the University of
    Illinois/NCSA Open Source License. Much of the current codebase has been
    developed by Tresys Technology, who is open to relicensing the code to
    the Apache License version 2.0 and donate the source to the ASF.
    Contacts at NCSA are also open to relicensing their contributions to
    Apache v2. We plan to contact the other contributors and ask for
    permission to relicense and donate their contributed code. For those
    that decline or we cannot contact, their code will be removed or
    replaced. We will work closely with Apache Legal to ensure all issues
    related to relicensing are acceptable.
    
    == External Dependencies ==
    
    We believe all current dependencies are compatible with the ASF
    guidelines. Our dependency licenses come from the following license
    styles: Apache v2, BSD, MIT, and ICU. The list of current Daffodil
    dependencies and their licenses are documented here:
    
    https://opensource.ncsa.illinois.edu/confluence/
    display/DFDL/Dependencies+and+Licenses
    
    == Cryptography ==
    
    None
    
    == Required Resources ==
    
    === Mailing Lists ===
    
    * comm...@daffodil.incubator.apache.org
    * d...@daffodil.incubator.apache.org
    * priv...@daffodil.incubator.apache.org
    * u...@daffodil.incubator.apache.org
    
    === Source Control ===
    
    git://git.apache.org/incubator-daffodil.git
    
    === Issue Tracking ===
    
    JIRA Daffodil (DFDL)
    
    === Initial Committers ===
    
    * Beth Finnegan <efinnegan at tresys dot com>
    * Dave Thompson <dthompson at tresys dot com>
    * Josh Adams <jadams at tresys dot com>
    * Mike Beckerle <mbeckerle at tresys dot com>
    * Steve Lawrence <slawrence at tresys dot com>
    * Taylor Wise <twise at tresys dot com>
    
    === Affiliations ===
    
    * Beth Finnegan (Tresys Technology)
    * Dave Thompson (Tresys Technology)
    * Josh Adams (Tresys Technology)
    * Mike Beckerle (Tresys Technology)
    * Steve Lawrence (Tresys Technology)
    * Taylor Wise (Tresys Technology)
    
    == Sponsors ==
    
    === Champion ===
    
    * TBD
    
    === Nominated Mentors ===
    
    * TBD
    
    === Sponsoring Entity ===
    
    We request the Apache Incubator to sponsor this project.
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
    For additional commands, e-mail: general-h...@incubator.apache.org
    
    
    
    
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: 
general-unsubscr...@incubator.apache.org<mailto:general-unsubscr...@incubator.apache.org>
    For additional commands, e-mail: 
general-h...@incubator.apache.org<mailto:general-h...@incubator.apache.org>
    
    
    
    
    



---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Reply via email to