Thanks JB, no problem. You are welcome to join so again I will call a VOTE in a few days, so please add yourself before then. Cheers.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Jean-Baptiste Onofré <j...@nanthrax.net> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org> Date: Tuesday, January 19, 2016 at 1:46 AM To: "general@incubator.apache.org" <general@incubator.apache.org> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit >I would be honoured. However, as I'm champion on other coming proposals, >and to keep a good help level, I prefer to wait a couple of days to see >if others jump in. If you need an additional mentor, please let me know. > >Thanks Chris ! >Regards >JB > >On 01/19/2016 08:11 AM, Mattmann, Chris A (3980) wrote: >> Thanks JB - if you are interested in mentoring would appreciate >> the help. >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> -----Original Message----- >> From: Jean-Baptiste Onofré <j...@nanthrax.net> >> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org> >> Date: Monday, January 18, 2016 at 11:01 PM >> To: "general@incubator.apache.org" <general@incubator.apache.org> >> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine >> Translation Toolkit >> >>> Hi Chris, >>> >>> it looks interesting. I'm looking forward for the vote. >>> >>> Regards >>> JB >>> >>> On 01/13/2016 07:56 AM, Mattmann, Chris A (3980) wrote: >>>> Hi Everyone, >>>> >>>> Please find attached for your viewing pleasure a proposed new project, >>>> Apache Joshua, a statistical machine translation toolkit. The proposal >>>> is in wiki draft form at: >>>> https://wiki.apache.org/incubator/JoshuaProposal >>>> >>>> Proposal text is copied below. I’ll leave the discussion open for a >>>>week >>>> and we are interested in folks who would like to be initial committers >>>> and mentors. Please discuss here on the thread. >>>> >>>> Thanks! >>>> >>>> Cheers, >>>> Chris (Champion) >>>> >>>> ——— >>>> >>>> = Joshua Proposal = >>>> >>>> == Abstract == >>>> [[joshua-decoder.org|Joshua]] is an open-source statistical machine >>>> translation toolkit. It includes a Java-based decoder for translating >>>> with >>>> phrase-based, hierarchical, and syntax-based translation models, a >>>> Hadoop-based grammar extractor (Thrax), and an extensive set of tools >>>> and >>>> scripts for training and evaluating new models from parallel text. >>>> >>>> == Proposal == >>>> Joshua is a state of the art statistical machine translation system >>>>that >>>> provides a number of features: >>>> >>>> * Support for the two main paradigms in statistical machine >>>> translation: >>>> phrase-based and hierarchical / syntactic. >>>> * A sparse feature API that makes it easy to add new feature >>>>templates >>>> supporting millions of features >>>> * Native implementations of many tuners (MERT, MIRA, PRO, and >>>>AdaGrad) >>>> * Support for lattice decoding, allowing upstream NLP tools to >>>>expose >>>> their hypothesis space to the MT system >>>> * An efficient representation for models, allowing for quick >>>>loading >>>> of >>>> multi-gigabyte model files >>>> * Fast decoding speed (on par with Moses and mtplz) >>>> * Language packs — precompiled models that allow the decoder to be >>>> run as >>>> a black box >>>> * Thrax, a Hadoop-based tool for learning translation models from >>>> parallel text >>>> * A suite of tools for constructing new models for any language >>>>pair >>>> for >>>> which sufficient training data exists >>>> >>>> == Background and Rationale == >>>> A number of factors make this a good time for an Apache project >>>>focused >>>> on >>>> machine translation (MT): the quality of MT output (for many language >>>> pairs); the average computing resources available on computers, >>>>relative >>>> to the needs of MT systems; and the availability of a number of >>>> high-quality toolkits, together with a large base of researchers >>>>working >>>> on them. >>>> >>>> Over the past decade, machine translation (MT; the automatic >>>>translation >>>> of one human language to another) has become a reality. The research >>>> into >>>> statistical approaches to translation that began in the early >>>>nineties, >>>> together with the availability of large amounts of training data, and >>>> better computing infrastructure, have all come together to produce >>>> translations results that are “good enough” for a large set of >>>>language >>>> pairs and use cases. Free services like >>>> [[https://www.bing.com/translator|Bing Translator]] and >>>> [[https://translate.google.com|Google Translate]] have made these >>>> services >>>> available to the average person through direct interfaces and through >>>> tools like browser plugins, and sites across the world with higher >>>> translation needs use them to translate their pages through >>>> automatically. >>>> >>>> MT does not require the infrastructure of large corporations in order >>>>to >>>> produce feasible output. Machine translation can be >>>>resource-intensive, >>>> but need not be prohibitively so. Disk and memory usage are mostly a >>>> matter of model size, which for most language pairs is a few gigabytes >>>> at >>>> most, at which size models can provide coverage on the order of tens >>>>or >>>> even hundreds of thousands of words in the input and output languages. >>>> The >>>> computational complexity of the algorithms used to search for >>>> translations >>>> of new sentences are typically linear in the number of words in the >>>> input >>>> sentence, making it possible to run a translation engine on a personal >>>> computer. >>>> >>>> The research community has produced many different open source >>>> translation >>>> projects for a range of programming languages and under a variety of >>>> licenses. These projects include the core “decoder”, which takes a >>>>model >>>> and uses it to translate new sentences between the language pair the >>>> model >>>> was defined for. They also typically include a large set of tools that >>>> enable new models to be built from large sets of example translations >>>> (“parallel data”) and monolingual texts. These toolkits are usually >>>> built >>>> to support the agendas of the (largely) academic researchers that >>>>build >>>> them: the repeated cycle of building new models, tuning model >>>>parameters >>>> against development data, and evaluating them against held-out test >>>> data, >>>> using standard metrics for testing the quality of MT output. >>>> >>>> Together, these three factors—the quality of machine translation >>>>output, >>>> the feasibility of translating on standard computers, and the >>>> availability >>>> of tools to build models—make it reasonable for the end users to use >>>>MT >>>> as >>>> a black-box service, and to run it on their personal machine. >>>> >>>> These factors make it a good time for an organization with the status >>>>of >>>> the Apache Foundation to host a machine translation project. >>>> >>>> == Current Status == >>>> Joshua was originally ported from David Chiang’s Python implementation >>>> of >>>> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins >>>> University. The current version is maintained by Matt Post at Johns >>>> Hopkins’ Human Language Technology Center of Excellence. Joshua has >>>>made >>>> many releases with a list of over 20 source code tags. The last >>>>release >>>> of >>>> Joshua was 6.0.5 on November 5th, 2015. >>>> >>>> == Meritocracy == >>>> The current developers are familiar with meritocratic open source >>>> development at Apache. Apache was chosen specifically because we want >>>>to >>>> encourage this style of development for the project. >>>> >>>> == Community == >>>> Joshua is used widely across the world. Perhaps its biggest (known) >>>> research / industrial user is the Amazon research group in Berlin. >>>> Another >>>> user is the US Army Research Lab. No formal census has been >>>>undertaken, >>>> but posts to the Joshua technical support mailing list, along with the >>>> occasional contributions, suggest small research and academic >>>> communities >>>> spread across the world, many of them in India. >>>> >>>> During incubation, we will explicitly seek to increase our usage >>>>across >>>> the board, including academic research, industry, and other end users >>>> interested in statistical machine translation. >>>> >>>> == Core Developers == >>>> The current set of core developers is fairly small, having fallen with >>>> the >>>> graduation from Johns Hopkins of some core student participants. >>>> However, >>>> Joshua is used fairly widely, as mentioned above, and there remains a >>>> commitment from the principal researcher at Johns Hopkins to continue >>>>to >>>> use and develop it. Joshua has seen a number of new community members >>>> become interested recently due to a potential for its projected use >>>>in a >>>> number of ongoing DARPA projects such as XDATA and Memex. >>>> >>>> == Alignment == >>>> Joshua is currently Copyright (c) 2015, Johns Hopkins University All >>>> rights reserved and licensed under BSD 2-clause license. It would of >>>> course be the intention to relicense this code under AL2.0 which would >>>> permit expanded and increased use of the software within Apache >>>> projects. >>>> There is currently an ongoing effort within the Apache Tika community >>>>to >>>> utilize Joshua within Tika’s Translate API, see >>>> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]]. >>>> >>>> == Known Risks == >>>> >>>> === Orphaned products === >>>> At the moment, regular contributions are made by a single contributor, >>>> the >>>> lead maintainer. He (Matt Post) plans to continue development for the >>>> next >>>> few years, but it is still a single point of failure, since the >>>>graduate >>>> students who worked on the project have moved on to jobs, mostly in >>>> industry. However, our goal is to help that process by growing the >>>> community in Apache, and at least in growing the community with users >>>> and >>>> participants from NASA JPL. >>>> >>>> === Inexperience with Open Source === >>>> The team both at Johns Hopkins and NASA JPL have experience with many >>>> OSS >>>> software projects at Apache and elsewhere. We understand "how it >>>>works" >>>> here at the foundation. >>>> >>>> >>>> == Relationships with Other Apache Products == >>>> Joshua includes dependences on Hadoop, and also is included as a >>>>plugin >>>> in >>>> Apache Tika. We are also interested in coordinating with other >>>>projects >>>> including Spark, and other projects needing MT services for language >>>> translation. >>>> >>>> == Developers == >>>> Joshua only has one regular developer who is employed by Johns Hopkins >>>> University. NASA JPL (Mattmann and McGibbney) have been contributing >>>> lately including a Brew formula and other contributions to the project >>>> through the DARPA XDATA and Memex programs. >>>> >>>> == Documentation == >>>> Documentation and publications related to Joshua can be found at >>>> joshua-decoder.org. The source for the Joshua documentation is >>>>currently >>>> hosted on Github at >>>> https://github.com/joshua-decoder/joshua-decoder.github.com >>>> >>>> == Initial Source == >>>> Current source resides at Github: github.com/joshua-decoder/joshua >>>>(the >>>> main decoder and toolkit) and github.com/joshua-decoder/thrax (the >>>> grammar >>>> extraction tool). >>>> >>>> == External Dependencies == >>>> Joshua has a number of external dependencies. Only BerkeleyLM (Apache >>>> 2.0) >>>> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which >>>>is >>>> needed for translating sentences with pre-built models). The rest are >>>> dependencies for the build system and pipeline, used for constructing >>>> and >>>> training new models from parallel text. >>>> >>>> Apache projects: >>>> * Ant >>>> * Hadoop >>>> * Commons >>>> * Maven >>>> * Ivy >>>> >>>> There are also a number of other open-source projects with various >>>> licenses that the project depends on both dynamically (runtime), and >>>> statically. >>>> >>>> === GNU GPL 2 === >>>> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/ >>>> >>>> === LGPG 2.1 === >>>> * KenLM: github.com/kpu/kenlm >>>> >>>> === Apache 2.0 === >>>> * BerkeleyLM: https://code.google.com/p/berkeleylm/ >>>> >>>> === GNU GPL === >>>> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html >>>> >>>> == Required Resources == >>>> * Mailing Lists >>>> * priv...@joshua.incubator.apache.org >>>> * d...@joshua.incubator.apache.org >>>> * comm...@joshua.incubator.apache.org >>>> >>>> * Git Repos >>>> * https://git-wip-us.apache.org/repos/asf/joshua.git >>>> >>>> * Issue Tracking >>>> * JIRA Joshua (JOSHUA) >>>> >>>> * Continuous Integration >>>> * Jenkins builds on https://builds.apache.org/ >>>> >>>> * Web >>>> * http://joshua.incubator.apache.org/ >>>> * wiki at http://cwiki.apache.org >>>> >>>> == Initial Committers == >>>> The following is a list of the planned initial Apache committers (the >>>> active subset of the committers for the current repository on Github). >>>> >>>> * Matt Post (p...@cs.jhu.edu) >>>> * Lewis John McGibbney (lewi...@apache.org) >>>> * Chris Mattmann (mattm...@apache.org) >>>> >>>> == Affiliations == >>>> >>>> * Johns Hopkins University >>>> * Matt Post >>>> >>>> * NASA JPL >>>> * Chris Mattmann >>>> * Lewis John McGibbney >>>> >>>> >>>> == Sponsors == >>>> === Champion === >>>> * Chris Mattmann (NASA/JPL) >>>> >>>> === Nominated Mentors === >>>> * Paul Ramirez >>>> * Lewis John McGibbney >>>> * Chris Mattmann >>>> >>>> == Sponsoring Entity == >>>> The Apache Incubator >>>> >>>> >>>> >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Chief Architect >>>> Instrument Software and Science Data Systems Section (398) >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 168-519, Mailstop: 168-527 >>>> Email: chris.a.mattm...@nasa.gov >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Associate Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>>> For additional commands, e-mail: general-h...@incubator.apache.org >>>> >>> >>> -- >>> Jean-Baptiste Onofré >>> jbono...@apache.org >>> http://blog.nanthrax.net >>> Talend - http://www.talend.com >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>> For additional commands, e-mail: general-h...@incubator.apache.org >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> > >-- >Jean-Baptiste Onofré >jbono...@apache.org >http://blog.nanthrax.net >Talend - http://www.talend.com > >--------------------------------------------------------------------- >To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >For additional commands, e-mail: general-h...@incubator.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org