External is good news. I'm not sure how much leeway there is in the following quote from [1], but what percentage of your users are currently using an all-ASF-compatible set of projects? The question to ask yourself in this situation is: * "Will the majority of users want to use my product without adding the optional components?"
-Alex [1] http://www.apache.org/legal/resolved.html On 1/20/16, 7:17 AM, "Matt Post" <p...@cs.jhu.edu> wrote: >The dependencies can be split into two kinds: ones required for building >new models, and ones needed by the decoder to translate new sentences >with a pre-built model (i.e., black-box translation with the language >packs). > >1. For building new models, you need a way to align the words between >sentences in parallel text. Both the aligners used by Joshua (GIZA++ and >the Berkeley aligner) are GPL of some form. These can be implemented as >external dependencies, or can be replaced with another aligner, like >fast_align (https://github.com/clab/fast_align), which is >Apache-licensed. There are many other options, in fact. So this should >not be a worry. > >2. For doing black-box translation, one needs to represent the language >model, which is very large. The best tool for this is KenLM >(github.com/kpu/kenlm), which is LGPL 2.1. There is also BerkeleyLM, >which is just as good for practical purposes and is Apache-licensed. >KenLM is C++ and is loaded via the JNI, whereas BerkeleyLM is written in >Java. I have moved to including BerkeleyLM in language packs, because I >can then include the Joshua-runtime, and people can translate without >even having to compile anything. > >So in short, there are no hard dependencies on unfavorably-licensed >external projects. > >matt > > > > >> On Jan 20, 2016, at 10:08 AM, Mattmann, Chris A (3980) >><chris.a.mattm...@jpl.nasa.gov> wrote: >> >> Hey Hen, >> >> Matt Post who I believe is monitoring this list and who has >> been one of the key Joshua developers and I have discussed this >> and we believe that potentially GPL/LGPL dependencies can: >> >> 1. be replaced with category-A or category-B alternatives. Matt >> mentioned one already to me which has slipped my mind. >> 2. be made in such a way that they are external tools and the >> bindings exist in Joshua to call those external tools (aka runtime >> deps akin to depending on a C compiler, etc.) >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> -----Original Message----- >> From: Henri Yandell <bay...@apache.org> >> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org> >> Date: Tuesday, January 19, 2016 at 7:38 PM >> To: "general@incubator.apache.org" <general@incubator.apache.org> >> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine >> Translation Toolkit >> >>> License-wise, any expectation of problems from the GPL and LGPL >>> dependencies? >>> >>> On Mon, Jan 18, 2016 at 9:58 PM, Mattmann, Chris A (3980) < >>> chris.a.mattm...@jpl.nasa.gov> wrote: >>> >>>> Great Hen, we’d love to have you on board as a mentor! Please >>>> add yourself to the proposal on the wiki. >>>> >>>> Anyone else have interest in Machine Translation? Any OpenNLP folks, >>>> Hadoop folks, Tika, or Lucene folks? CC’ing the dev lists for >>>>visibility >>>> please feel free to reply to general@i.a.o. >>>> >>>> I’ll leave the DISCUSS thread open for a few more days. >>>> >>>> Cheers, >>>> Chris >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Chief Architect >>>> Instrument Software and Science Data Systems Section (398) >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 168-519, Mailstop: 168-527 >>>> Email: chris.a.mattm...@nasa.gov >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Associate Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Henri Yandell <bay...@apache.org> >>>> Reply-To: "general@incubator.apache.org" >>>><general@incubator.apache.org> >>>> Date: Monday, January 18, 2016 at 7:57 PM >>>> To: jpluser <chris.a.mattm...@jpl.nasa.gov>, >>>> "general@incubator.apache.org" <general@incubator.apache.org> >>>> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine >>>> Translation Toolkit >>>> >>>>> Non-binding +1 to Joshua joining the Incubator. I'd be interested in >>>>> mentoring. >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: jpluser <chris.a.mattm...@jpl.nasa.gov> >>>>>> Reply-To: "general@incubator.apache.org" >>>> <general@incubator.apache.org> >>>>>> Date: Tuesday, January 12, 2016 at 10:56 PM >>>>>> To: "general@incubator.apache.org" <general@incubator.apache.org> >>>>>> Cc: "p...@cs.jhu.edu" <p...@cs.jhu.edu> >>>>>> Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine >>>>>> Translation >>>>>> Toolkit >>>>>> >>>>>>> Hi Everyone, >>>>>>> >>>>>>> Please find attached for your viewing pleasure a proposed new >>>> project, >>>>>>> Apache Joshua, a statistical machine translation toolkit. The >>>> proposal >>>>>>> is in wiki draft form at: >>>>>> https://wiki.apache.org/incubator/JoshuaProposal >>>>>>> >>>>>>> Proposal text is copied below. I’ll leave the discussion open for a >>>>>> week >>>>>>> and we are interested in folks who would like to be initial >>>> committers >>>>>>> and mentors. Please discuss here on the thread. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> Cheers, >>>>>>> Chris (Champion) >>>>>>> >>>>>>> ——— >>>>>>> >>>>>>> = Joshua Proposal = >>>>>>> >>>>>>> == Abstract == >>>>>>> [[joshua-decoder.org|Joshua]] is an open-source statistical machine >>>>>>> translation toolkit. It includes a Java-based decoder for >>>> translating >>>>>> with >>>>>>> phrase-based, hierarchical, and syntax-based translation models, a >>>>>>> Hadoop-based grammar extractor (Thrax), and an extensive set of >>>> tools >>>>>> and >>>>>>> scripts for training and evaluating new models from parallel text. >>>>>>> >>>>>>> == Proposal == >>>>>>> Joshua is a state of the art statistical machine translation system >>>>>> that >>>>>>> provides a number of features: >>>>>>> >>>>>>> * Support for the two main paradigms in statistical machine >>>>>> translation: >>>>>>> phrase-based and hierarchical / syntactic. >>>>>>> * A sparse feature API that makes it easy to add new feature >>>> templates >>>>>>> supporting millions of features >>>>>>> * Native implementations of many tuners (MERT, MIRA, PRO, and >>>> AdaGrad) >>>>>>> * Support for lattice decoding, allowing upstream NLP tools to >>>> expose >>>>>>> their hypothesis space to the MT system >>>>>>> * An efficient representation for models, allowing for quick >>>> loading >>>>>> of >>>>>>> multi-gigabyte model files >>>>>>> * Fast decoding speed (on par with Moses and mtplz) >>>>>>> * Language packs — precompiled models that allow the decoder to be >>>>>> run as >>>>>>> a black box >>>>>>> * Thrax, a Hadoop-based tool for learning translation models from >>>>>>> parallel text >>>>>>> * A suite of tools for constructing new models for any language >>>> pair >>>>>> for >>>>>>> which sufficient training data exists >>>>>>> >>>>>>> == Background and Rationale == >>>>>>> A number of factors make this a good time for an Apache project >>>>>> focused on >>>>>>> machine translation (MT): the quality of MT output (for many >>>> language >>>>>>> pairs); the average computing resources available on computers, >>>>>> relative >>>>>>> to the needs of MT systems; and the availability of a number of >>>>>>> high-quality toolkits, together with a large base of researchers >>>>>> working >>>>>>> on them. >>>>>>> >>>>>>> Over the past decade, machine translation (MT; the automatic >>>>>> translation >>>>>>> of one human language to another) has become a reality. The >>>>>>>research >>>>>> into >>>>>>> statistical approaches to translation that began in the early >>>> nineties, >>>>>>> together with the availability of large amounts of training data, >>>> and >>>>>>> better computing infrastructure, have all come together to produce >>>>>>> translations results that are “good enough” for a large set of >>>> language >>>>>>> pairs and use cases. Free services like >>>>>>> [[https://www.bing.com/translator|Bing Translator]] and >>>>>>> [[https://translate.google.com|Google Translate]] have made these >>>>>> services >>>>>>> available to the average person through direct interfaces and >>>> through >>>>>>> tools like browser plugins, and sites across the world with higher >>>>>>> translation needs use them to translate their pages through >>>>>> automatically. >>>>>>> >>>>>>> MT does not require the infrastructure of large corporations in >>>> order >>>>>> to >>>>>>> produce feasible output. Machine translation can be >>>> resource-intensive, >>>>>>> but need not be prohibitively so. Disk and memory usage are mostly >>>>>>>a >>>>>>> matter of model size, which for most language pairs is a few >>>> gigabytes >>>>>> at >>>>>>> most, at which size models can provide coverage on the order of >>>> tens or >>>>>>> even hundreds of thousands of words in the input and output >>>> languages. >>>>>> The >>>>>>> computational complexity of the algorithms used to search for >>>>>> translations >>>>>>> of new sentences are typically linear in the number of words in the >>>>>> input >>>>>>> sentence, making it possible to run a translation engine on a >>>> personal >>>>>>> computer. >>>>>>> >>>>>>> The research community has produced many different open source >>>>>> translation >>>>>>> projects for a range of programming languages and under a variety >>>>>>>of >>>>>>> licenses. These projects include the core “decoder”, which takes a >>>>>> model >>>>>>> and uses it to translate new sentences between the language pair >>>>>>>the >>>>>> model >>>>>>> was defined for. They also typically include a large set of tools >>>> that >>>>>>> enable new models to be built from large sets of example >>>> translations >>>>>>> (“parallel data”) and monolingual texts. These toolkits are usually >>>>>> built >>>>>>> to support the agendas of the (largely) academic researchers that >>>> build >>>>>>> them: the repeated cycle of building new models, tuning model >>>>>> parameters >>>>>>> against development data, and evaluating them against held-out test >>>>>> data, >>>>>>> using standard metrics for testing the quality of MT output. >>>>>>> >>>>>>> Together, these three factors—the quality of machine translation >>>>>> output, >>>>>>> the feasibility of translating on standard computers, and the >>>>>> availability >>>>>>> of tools to build models—make it reasonable for the end users to >>>>>>>use >>>>>> MT as >>>>>>> a black-box service, and to run it on their personal machine. >>>>>>> >>>>>>> These factors make it a good time for an organization with the >>>> status >>>>>> of >>>>>>> the Apache Foundation to host a machine translation project. >>>>>>> >>>>>>> == Current Status == >>>>>>> Joshua was originally ported from David Chiang’s Python >>>> implementation >>>>>> of >>>>>>> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins >>>>>>> University. The current version is maintained by Matt Post at Johns >>>>>>> Hopkins’ Human Language Technology Center of Excellence. Joshua has >>>>>> made >>>>>>> many releases with a list of over 20 source code tags. The last >>>>>> release of >>>>>>> Joshua was 6.0.5 on November 5th, 2015. >>>>>>> >>>>>>> == Meritocracy == >>>>>>> The current developers are familiar with meritocratic open source >>>>>>> development at Apache. Apache was chosen specifically because we >>>> want >>>>>> to >>>>>>> encourage this style of development for the project. >>>>>>> >>>>>>> == Community == >>>>>>> Joshua is used widely across the world. Perhaps its biggest (known) >>>>>>> research / industrial user is the Amazon research group in Berlin. >>>>>> Another >>>>>>> user is the US Army Research Lab. No formal census has been >>>> undertaken, >>>>>>> but posts to the Joshua technical support mailing list, along with >>>> the >>>>>>> occasional contributions, suggest small research and academic >>>>>> communities >>>>>>> spread across the world, many of them in India. >>>>>>> >>>>>>> During incubation, we will explicitly seek to increase our usage >>>> across >>>>>>> the board, including academic research, industry, and other end >>>> users >>>>>>> interested in statistical machine translation. >>>>>>> >>>>>>> == Core Developers == >>>>>>> The current set of core developers is fairly small, having fallen >>>> with >>>>>> the >>>>>>> graduation from Johns Hopkins of some core student participants. >>>>>> However, >>>>>>> Joshua is used fairly widely, as mentioned above, and there remains >>>> a >>>>>>> commitment from the principal researcher at Johns Hopkins to >>>> continue >>>>>> to >>>>>>> use and develop it. Joshua has seen a number of new community >>>> members >>>>>>> become interested recently due to a potential for its projected use >>>> in >>>>>> a >>>>>>> number of ongoing DARPA projects such as XDATA and Memex. >>>>>>> >>>>>>> == Alignment == >>>>>>> Joshua is currently Copyright (c) 2015, Johns Hopkins University >>>>>>>All >>>>>>> rights reserved and licensed under BSD 2-clause license. It would >>>>>>>of >>>>>>> course be the intention to relicense this code under AL2.0 which >>>> would >>>>>>> permit expanded and increased use of the software within Apache >>>>>> projects. >>>>>>> There is currently an ongoing effort within the Apache Tika >>>> community >>>>>> to >>>>>>> utilize Joshua within Tika’s Translate API, see >>>>>>> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]]. >>>>>>> >>>>>>> == Known Risks == >>>>>>> >>>>>>> === Orphaned products === >>>>>>> At the moment, regular contributions are made by a single >>>> contributor, >>>>>> the >>>>>>> lead maintainer. He (Matt Post) plans to continue development for >>>> the >>>>>> next >>>>>>> few years, but it is still a single point of failure, since the >>>>>> graduate >>>>>>> students who worked on the project have moved on to jobs, mostly in >>>>>>> industry. However, our goal is to help that process by growing the >>>>>>> community in Apache, and at least in growing the community with >>>> users >>>>>> and >>>>>>> participants from NASA JPL. >>>>>>> >>>>>>> === Inexperience with Open Source === >>>>>>> The team both at Johns Hopkins and NASA JPL have experience with >>>> many >>>>>> OSS >>>>>>> software projects at Apache and elsewhere. We understand "how it >>>> works" >>>>>>> here at the foundation. >>>>>>> >>>>>>> >>>>>>> == Relationships with Other Apache Products == >>>>>>> Joshua includes dependences on Hadoop, and also is included as a >>>>>> plugin in >>>>>>> Apache Tika. We are also interested in coordinating with other >>>> projects >>>>>>> including Spark, and other projects needing MT services for >>>>>>>language >>>>>>> translation. >>>>>>> >>>>>>> == Developers == >>>>>>> Joshua only has one regular developer who is employed by Johns >>>> Hopkins >>>>>>> University. NASA JPL (Mattmann and McGibbney) have been >>>>>>>contributing >>>>>>> lately including a Brew formula and other contributions to the >>>> project >>>>>>> through the DARPA XDATA and Memex programs. >>>>>>> >>>>>>> == Documentation == >>>>>>> Documentation and publications related to Joshua can be found at >>>>>>> joshua-decoder.org. The source for the Joshua documentation is >>>>>> currently >>>>>>> hosted on Github at >>>>>>> https://github.com/joshua-decoder/joshua-decoder.github.com >>>>>>> >>>>>>> == Initial Source == >>>>>>> Current source resides at Github: github.com/joshua-decoder/joshua >>>> (the >>>>>>> main decoder and toolkit) and github.com/joshua-decoder/thrax (the >>>>>> grammar >>>>>>> extraction tool). >>>>>>> >>>>>>> == External Dependencies == >>>>>>> Joshua has a number of external dependencies. Only BerkeleyLM >>>> (Apache >>>>>> 2.0) >>>>>>> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of >>>> which is >>>>>>> needed for translating sentences with pre-built models). The rest >>>> are >>>>>>> dependencies for the build system and pipeline, used for >>>> constructing >>>>>> and >>>>>>> training new models from parallel text. >>>>>>> >>>>>>> Apache projects: >>>>>>> * Ant >>>>>>> * Hadoop >>>>>>> * Commons >>>>>>> * Maven >>>>>>> * Ivy >>>>>>> >>>>>>> There are also a number of other open-source projects with various >>>>>>> licenses that the project depends on both dynamically (runtime), >>>>>>>and >>>>>>> statically. >>>>>>> >>>>>>> === GNU GPL 2 === >>>>>>> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/ >>>>>>> >>>>>>> === LGPG 2.1 === >>>>>>> * KenLM: github.com/kpu/kenlm >>>>>>> >>>>>>> === Apache 2.0 === >>>>>>> * BerkeleyLM: https://code.google.com/p/berkeleylm/ >>>>>>> >>>>>>> === GNU GPL === >>>>>>> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html >>>>>>> >>>>>>> == Required Resources == >>>>>>> * Mailing Lists >>>>>>> * priv...@joshua.incubator.apache.org >>>>>>> * d...@joshua.incubator.apache.org >>>>>>> * comm...@joshua.incubator.apache.org >>>>>>> >>>>>>> * Git Repos >>>>>>> * https://git-wip-us.apache.org/repos/asf/joshua.git >>>>>>> >>>>>>> * Issue Tracking >>>>>>> * JIRA Joshua (JOSHUA) >>>>>>> >>>>>>> * Continuous Integration >>>>>>> * Jenkins builds on https://builds.apache.org/ >>>>>>> >>>>>>> * Web >>>>>>> * http://joshua.incubator.apache.org/ >>>>>>> * wiki at http://cwiki.apache.org >>>>>>> >>>>>>> == Initial Committers == >>>>>>> The following is a list of the planned initial Apache committers >>>> (the >>>>>>> active subset of the committers for the current repository on >>>> Github). >>>>>>> >>>>>>> * Matt Post (p...@cs.jhu.edu) >>>>>>> * Lewis John McGibbney (lewi...@apache.org) >>>>>>> * Chris Mattmann (mattm...@apache.org) >>>>>>> >>>>>>> == Affiliations == >>>>>>> >>>>>>> * Johns Hopkins University >>>>>>> * Matt Post >>>>>>> >>>>>>> * NASA JPL >>>>>>> * Chris Mattmann >>>>>>> * Lewis John McGibbney >>>>>>> >>>>>>> >>>>>>> == Sponsors == >>>>>>> === Champion === >>>>>>> * Chris Mattmann (NASA/JPL) >>>>>>> >>>>>>> === Nominated Mentors === >>>>>>> * Paul Ramirez >>>>>>> * Lewis John McGibbney >>>>>>> * Chris Mattmann >>>>>>> >>>>>>> == Sponsoring Entity == >>>>>>> The Apache Incubator >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Chris Mattmann, Ph.D. >>>>>>> Chief Architect >>>>>>> Instrument Software and Science Data Systems Section (398) >>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>>> Office: 168-519, Mailstop: 168-527 >>>>>>> Email: chris.a.mattm...@nasa.gov >>>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> Adjunct Associate Professor, Computer Science Department >>>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >>>>>>> >>>>>>>?B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK >>>>>>>KK >>>>>>> KC >>>>>>> B� >>>>>> >>>> >>>>>>> >>>>>>>?�?[��X��ܚX�K??K[XZ[?�?�[�\�[?][��X��ܚX�P?[��X�]?܋�\?X�?K�ܙ�B��܈?Y?? >>>>>>>]? >>>>>>> [ۘ >>>>>>> [? >>>>>>> ?��[X[�?�??K[XZ[?�?�[�\�[?Z?[???[��X�]?܋�\?X�?K�ܙ�B >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>>> For additional commands, e-mail: general-h...@incubator.apache.org >>>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >For additional commands, e-mail: general-h...@incubator.apache.org >