Hey Hen, Matt Post who I believe is monitoring this list and who has been one of the key Joshua developers and I have discussed this and we believe that potentially GPL/LGPL dependencies can:
1. be replaced with category-A or category-B alternatives. Matt mentioned one already to me which has slipped my mind. 2. be made in such a way that they are external tools and the bindings exist in Joshua to call those external tools (aka runtime deps akin to depending on a C compiler, etc.) Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Henri Yandell <bay...@apache.org> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org> Date: Tuesday, January 19, 2016 at 7:38 PM To: "general@incubator.apache.org" <general@incubator.apache.org> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit >License-wise, any expectation of problems from the GPL and LGPL >dependencies? > >On Mon, Jan 18, 2016 at 9:58 PM, Mattmann, Chris A (3980) < >chris.a.mattm...@jpl.nasa.gov> wrote: > >> Great Hen, we’d love to have you on board as a mentor! Please >> add yourself to the proposal on the wiki. >> >> Anyone else have interest in Machine Translation? Any OpenNLP folks, >> Hadoop folks, Tika, or Lucene folks? CC’ing the dev lists for visibility >> please feel free to reply to general@i.a.o. >> >> I’ll leave the DISCUSS thread open for a few more days. >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> -----Original Message----- >> From: Henri Yandell <bay...@apache.org> >> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org> >> Date: Monday, January 18, 2016 at 7:57 PM >> To: jpluser <chris.a.mattm...@jpl.nasa.gov>, >> "general@incubator.apache.org" <general@incubator.apache.org> >> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine >> Translation Toolkit >> >> >Non-binding +1 to Joshua joining the Incubator. I'd be interested in >> >mentoring. >> > >> > >> >> -----Original Message----- >> >> From: jpluser <chris.a.mattm...@jpl.nasa.gov> >> >> Reply-To: "general@incubator.apache.org" >><general@incubator.apache.org> >> >> Date: Tuesday, January 12, 2016 at 10:56 PM >> >> To: "general@incubator.apache.org" <general@incubator.apache.org> >> >> Cc: "p...@cs.jhu.edu" <p...@cs.jhu.edu> >> >> Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine >> >>Translation >> >> Toolkit >> >> >> >> >Hi Everyone, >> >> > >> >> >Please find attached for your viewing pleasure a proposed new >>project, >> >> >Apache Joshua, a statistical machine translation toolkit. The >>proposal >> >> >is in wiki draft form at: >> >> https://wiki.apache.org/incubator/JoshuaProposal >> >> > >> >> >Proposal text is copied below. I’ll leave the discussion open for a >> >>week >> >> >and we are interested in folks who would like to be initial >>committers >> >> >and mentors. Please discuss here on the thread. >> >> > >> >> >Thanks! >> >> > >> >> >Cheers, >> >> >Chris (Champion) >> >> > >> >> >——— >> >> > >> >> >= Joshua Proposal = >> >> > >> >> >== Abstract == >> >> >[[joshua-decoder.org|Joshua]] is an open-source statistical machine >> >> >translation toolkit. It includes a Java-based decoder for >>translating >> >>with >> >> >phrase-based, hierarchical, and syntax-based translation models, a >> >> >Hadoop-based grammar extractor (Thrax), and an extensive set of >>tools >> >>and >> >> >scripts for training and evaluating new models from parallel text. >> >> > >> >> >== Proposal == >> >> >Joshua is a state of the art statistical machine translation system >> >>that >> >> >provides a number of features: >> >> > >> >> > * Support for the two main paradigms in statistical machine >> >>translation: >> >> >phrase-based and hierarchical / syntactic. >> >> > * A sparse feature API that makes it easy to add new feature >>templates >> >> >supporting millions of features >> >> > * Native implementations of many tuners (MERT, MIRA, PRO, and >>AdaGrad) >> >> > * Support for lattice decoding, allowing upstream NLP tools to >>expose >> >> >their hypothesis space to the MT system >> >> > * An efficient representation for models, allowing for quick >>loading >> >>of >> >> >multi-gigabyte model files >> >> > * Fast decoding speed (on par with Moses and mtplz) >> >> > * Language packs — precompiled models that allow the decoder to be >> >>run as >> >> >a black box >> >> > * Thrax, a Hadoop-based tool for learning translation models from >> >> >parallel text >> >> > * A suite of tools for constructing new models for any language >>pair >> >>for >> >> >which sufficient training data exists >> >> > >> >> >== Background and Rationale == >> >> >A number of factors make this a good time for an Apache project >> >>focused on >> >> >machine translation (MT): the quality of MT output (for many >>language >> >> >pairs); the average computing resources available on computers, >> >>relative >> >> >to the needs of MT systems; and the availability of a number of >> >> >high-quality toolkits, together with a large base of researchers >> >>working >> >> >on them. >> >> > >> >> >Over the past decade, machine translation (MT; the automatic >> >>translation >> >> >of one human language to another) has become a reality. The research >> >>into >> >> >statistical approaches to translation that began in the early >>nineties, >> >> >together with the availability of large amounts of training data, >>and >> >> >better computing infrastructure, have all come together to produce >> >> >translations results that are “good enough” for a large set of >>language >> >> >pairs and use cases. Free services like >> >> >[[https://www.bing.com/translator|Bing Translator]] and >> >> >[[https://translate.google.com|Google Translate]] have made these >> >> services >> >> >available to the average person through direct interfaces and >>through >> >> >tools like browser plugins, and sites across the world with higher >> >> >translation needs use them to translate their pages through >> >>automatically. >> >> > >> >> >MT does not require the infrastructure of large corporations in >>order >> >>to >> >> >produce feasible output. Machine translation can be >>resource-intensive, >> >> >but need not be prohibitively so. Disk and memory usage are mostly a >> >> >matter of model size, which for most language pairs is a few >>gigabytes >> >>at >> >> >most, at which size models can provide coverage on the order of >>tens or >> >> >even hundreds of thousands of words in the input and output >>languages. >> >>The >> >> >computational complexity of the algorithms used to search for >> >>translations >> >> >of new sentences are typically linear in the number of words in the >> >>input >> >> >sentence, making it possible to run a translation engine on a >>personal >> >> >computer. >> >> > >> >> >The research community has produced many different open source >> >>translation >> >> >projects for a range of programming languages and under a variety of >> >> >licenses. These projects include the core “decoder”, which takes a >> >>model >> >> >and uses it to translate new sentences between the language pair the >> >>model >> >> >was defined for. They also typically include a large set of tools >>that >> >> >enable new models to be built from large sets of example >>translations >> >> >(“parallel data”) and monolingual texts. These toolkits are usually >> >>built >> >> >to support the agendas of the (largely) academic researchers that >>build >> >> >them: the repeated cycle of building new models, tuning model >> >>parameters >> >> >against development data, and evaluating them against held-out test >> >>data, >> >> >using standard metrics for testing the quality of MT output. >> >> > >> >> >Together, these three factors—the quality of machine translation >> >>output, >> >> >the feasibility of translating on standard computers, and the >> >>availability >> >> >of tools to build models—make it reasonable for the end users to use >> >>MT as >> >> >a black-box service, and to run it on their personal machine. >> >> > >> >> >These factors make it a good time for an organization with the >>status >> >>of >> >> >the Apache Foundation to host a machine translation project. >> >> > >> >> >== Current Status == >> >> >Joshua was originally ported from David Chiang’s Python >>implementation >> >>of >> >> >Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins >> >> >University. The current version is maintained by Matt Post at Johns >> >> >Hopkins’ Human Language Technology Center of Excellence. Joshua has >> >>made >> >> >many releases with a list of over 20 source code tags. The last >> >>release of >> >> >Joshua was 6.0.5 on November 5th, 2015. >> >> > >> >> >== Meritocracy == >> >> >The current developers are familiar with meritocratic open source >> >> >development at Apache. Apache was chosen specifically because we >>want >> >>to >> >> >encourage this style of development for the project. >> >> > >> >> >== Community == >> >> >Joshua is used widely across the world. Perhaps its biggest (known) >> >> >research / industrial user is the Amazon research group in Berlin. >> >>Another >> >> >user is the US Army Research Lab. No formal census has been >>undertaken, >> >> >but posts to the Joshua technical support mailing list, along with >>the >> >> >occasional contributions, suggest small research and academic >> >>communities >> >> >spread across the world, many of them in India. >> >> > >> >> >During incubation, we will explicitly seek to increase our usage >>across >> >> >the board, including academic research, industry, and other end >>users >> >> >interested in statistical machine translation. >> >> > >> >> >== Core Developers == >> >> >The current set of core developers is fairly small, having fallen >>with >> >>the >> >> >graduation from Johns Hopkins of some core student participants. >> >>However, >> >> >Joshua is used fairly widely, as mentioned above, and there remains >>a >> >> >commitment from the principal researcher at Johns Hopkins to >>continue >> >>to >> >> >use and develop it. Joshua has seen a number of new community >>members >> >> >become interested recently due to a potential for its projected use >>in >> >>a >> >> >number of ongoing DARPA projects such as XDATA and Memex. >> >> > >> >> >== Alignment == >> >> >Joshua is currently Copyright (c) 2015, Johns Hopkins University All >> >> >rights reserved and licensed under BSD 2-clause license. It would of >> >> >course be the intention to relicense this code under AL2.0 which >>would >> >> >permit expanded and increased use of the software within Apache >> >>projects. >> >> >There is currently an ongoing effort within the Apache Tika >>community >> >>to >> >> >utilize Joshua within Tika’s Translate API, see >> >> >[[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]]. >> >> > >> >> >== Known Risks == >> >> > >> >> >=== Orphaned products === >> >> >At the moment, regular contributions are made by a single >>contributor, >> >>the >> >> >lead maintainer. He (Matt Post) plans to continue development for >>the >> >>next >> >> >few years, but it is still a single point of failure, since the >> >>graduate >> >> >students who worked on the project have moved on to jobs, mostly in >> >> >industry. However, our goal is to help that process by growing the >> >> >community in Apache, and at least in growing the community with >>users >> >>and >> >> >participants from NASA JPL. >> >> > >> >> >=== Inexperience with Open Source === >> >> >The team both at Johns Hopkins and NASA JPL have experience with >>many >> >>OSS >> >> >software projects at Apache and elsewhere. We understand "how it >>works" >> >> >here at the foundation. >> >> > >> >> > >> >> >== Relationships with Other Apache Products == >> >> >Joshua includes dependences on Hadoop, and also is included as a >> >>plugin in >> >> >Apache Tika. We are also interested in coordinating with other >>projects >> >> >including Spark, and other projects needing MT services for language >> >> >translation. >> >> > >> >> >== Developers == >> >> >Joshua only has one regular developer who is employed by Johns >>Hopkins >> >> >University. NASA JPL (Mattmann and McGibbney) have been contributing >> >> >lately including a Brew formula and other contributions to the >>project >> >> >through the DARPA XDATA and Memex programs. >> >> > >> >> >== Documentation == >> >> >Documentation and publications related to Joshua can be found at >> >> >joshua-decoder.org. The source for the Joshua documentation is >> >>currently >> >> >hosted on Github at >> >> >https://github.com/joshua-decoder/joshua-decoder.github.com >> >> > >> >> >== Initial Source == >> >> >Current source resides at Github: github.com/joshua-decoder/joshua >> (the >> >> >main decoder and toolkit) and github.com/joshua-decoder/thrax (the >> >> grammar >> >> >extraction tool). >> >> > >> >> >== External Dependencies == >> >> >Joshua has a number of external dependencies. Only BerkeleyLM >>(Apache >> >>2.0) >> >> >and KenLM (LGPG 2.1) are run-time decoder dependencies (one of >>which is >> >> >needed for translating sentences with pre-built models). The rest >>are >> >> >dependencies for the build system and pipeline, used for >>constructing >> >>and >> >> >training new models from parallel text. >> >> > >> >> >Apache projects: >> >> > * Ant >> >> > * Hadoop >> >> > * Commons >> >> > * Maven >> >> > * Ivy >> >> > >> >> >There are also a number of other open-source projects with various >> >> >licenses that the project depends on both dynamically (runtime), and >> >> >statically. >> >> > >> >> >=== GNU GPL 2 === >> >> > * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/ >> >> > >> >> >=== LGPG 2.1 === >> >> > * KenLM: github.com/kpu/kenlm >> >> > >> >> >=== Apache 2.0 === >> >> > * BerkeleyLM: https://code.google.com/p/berkeleylm/ >> >> > >> >> >=== GNU GPL === >> >> > * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html >> >> > >> >> >== Required Resources == >> >> > * Mailing Lists >> >> > * priv...@joshua.incubator.apache.org >> >> > * d...@joshua.incubator.apache.org >> >> > * comm...@joshua.incubator.apache.org >> >> > >> >> > * Git Repos >> >> > * https://git-wip-us.apache.org/repos/asf/joshua.git >> >> > >> >> > * Issue Tracking >> >> > * JIRA Joshua (JOSHUA) >> >> > >> >> > * Continuous Integration >> >> > * Jenkins builds on https://builds.apache.org/ >> >> > >> >> > * Web >> >> > * http://joshua.incubator.apache.org/ >> >> > * wiki at http://cwiki.apache.org >> >> > >> >> >== Initial Committers == >> >> >The following is a list of the planned initial Apache committers >>(the >> >> >active subset of the committers for the current repository on >>Github). >> >> > >> >> > * Matt Post (p...@cs.jhu.edu) >> >> > * Lewis John McGibbney (lewi...@apache.org) >> >> > * Chris Mattmann (mattm...@apache.org) >> >> > >> >> >== Affiliations == >> >> > >> >> > * Johns Hopkins University >> >> > * Matt Post >> >> > >> >> > * NASA JPL >> >> > * Chris Mattmann >> >> > * Lewis John McGibbney >> >> > >> >> > >> >> >== Sponsors == >> >> >=== Champion === >> >> > * Chris Mattmann (NASA/JPL) >> >> > >> >> >=== Nominated Mentors === >> >> > * Paul Ramirez >> >> > * Lewis John McGibbney >> >> > * Chris Mattmann >> >> > >> >> >== Sponsoring Entity == >> >> >The Apache Incubator >> >> > >> >> > >> >> > >> >> > >> >> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >Chris Mattmann, Ph.D. >> >> >Chief Architect >> >> >Instrument Software and Science Data Systems Section (398) >> >> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> >> >Office: 168-519, Mailstop: 168-527 >> >> >Email: chris.a.mattm...@nasa.gov >> >> >WWW: http://sunset.usc.edu/~mattmann/ >> >> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >Adjunct Associate Professor, Computer Science Department >> >> >University of Southern California, Los Angeles, CA 90089 USA >> >> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> > >> >> > >> >> > >> >> >> >>>>>?B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK >>>>>KC >> >>>B� >> >> >> >>>>>?�?[��X��ܚX�K??K[XZ[?�?�[�\�[?][��X��ܚX�P?[��X�]?܋�\?X�?K�ܙ�B��܈?Y??]? >>>>>[ۘ >> >>>[? >> >> >?��[X[�?�??K[XZ[?�?�[�\�[?Z?[???[��X�]?܋�\?X�?K�ܙ�B >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >>