OK, cool... Just thought the topic warranted some level of discussion ;) > On Feb 1, 2016, at 10:31 AM, Tom Barber <tom.bar...@meteorite.bi> wrote: > > Hello! I'm a code-aholic, you'll be getting regular commits from me. > > Regards, > > Tom > > On Mon, Feb 1, 2016 at 3:20 PM, Mattmann, Chris A (3980) < > chris.a.mattm...@jpl.nasa.gov> wrote: > >> Hey Jim, >> >> This is a valid concern, one that I hope is mediated by taking >> however long it takes in Incubation to attract some new committers >> to work on the project. Hopefully too you saw how long I took to >> allow the discussion to occur and so forth. >> >> Lewis has actively contributed to Joshua already - you can see - >> via the HomeBrew package he created, see: >> >> https://github.com/Homebrew/homebrew/pull/45746 >> >> >> You can see too it wasn’t something just recent or something >> super quick it’s something he had to work at. >> >> As for me, my involvement is going to be limited, but I am >> actively pursuing Tika’s integration with Joshua as part of >> TIKA-1343: http://issues.apache.org/jira/browse/TIKA-1343. >> >> Finally my suspicion is that Tom, Henry and Tommaso will >> contribute a lot as well. >> >> Thanks for listening. >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> -----Original Message----- >> From: Jim Jagielski <j...@jagunet.com> >> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org> >> Date: Monday, February 1, 2016 at 4:20 AM >> To: "general@incubator.apache.org" <general@incubator.apache.org> >> Cc: "p...@cs.jhu.edu" <p...@cs.jhu.edu> >> Subject: Re: [VOTE] Accept Joshua as an Apache Incubator Podling >> >>> I know this is specifically called-out in the proposal, but it >>> does seem worthy of further discussion. >>> >>> This has a pretty small list of initial committers, esp when one considers >>> how over-booked 2 of them appear to be. >>> >>> So, realistically, how active do both Chris and Lewis expect >>> to be? >>> >>>> On Jan 30, 2016, at 3:00 PM, Mattmann, Chris A (3980) >>>> <chris.a.mattm...@jpl.nasa.gov> wrote: >>>> >>>> Hi Everyone, >>>> >>>> OK the discussion is now completed. Please VOTE to accept Joshua >>>> into the Apache Incubator. I’ll leave the VOTE open for at least >>>> the next 72 hours, with hopes to close it next Friday the 5th of >>>> February, 2016. >>>> >>>> [ ] +1 Accept Joshua as an Apache Incubator podling. >>>> [ ] +0 Abstain. >>>> [ ] -1 Don’t accept Joshua as an Apache Incubator podling because.. >>>> >>>> Of course, I am +1 on this. Please note VOTEs from Incubator PMC >>>> members are binding but all are welcome to VOTE! >>>> >>>> Cheers, >>>> Chris >>>> >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Chris Mattmann, Ph.D. >>>> Chief Architect >>>> Instrument Software and Science Data Systems Section (398) >>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>> Office: 168-519, Mailstop: 168-527 >>>> Email: chris.a.mattm...@nasa.gov >>>> WWW: http://sunset.usc.edu/~mattmann/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Adjunct Associate Professor, Computer Science Department >>>> University of Southern California, Los Angeles, CA 90089 USA >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: jpluser <chris.a.mattm...@jpl.nasa.gov> >>>> Date: Tuesday, January 12, 2016 at 10:56 PM >>>> To: "general@incubator.apache.org" <general@incubator.apache.org> >>>> Cc: "p...@cs.jhu.edu" <p...@cs.jhu.edu> >>>> Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine >>>> Translation >>>> Toolkit >>>> >>>>> Hi Everyone, >>>>> >>>>> Please find attached for your viewing pleasure a proposed new project, >>>>> Apache Joshua, a statistical machine translation toolkit. The proposal >>>>> is in wiki draft form at: >>>>> https://wiki.apache.org/incubator/JoshuaProposal >>>>> >>>>> Proposal text is copied below. I’ll leave the discussion open for a >>>>> week >>>>> and we are interested in folks who would like to be initial committers >>>>> and mentors. Please discuss here on the thread. >>>>> >>>>> Thanks! >>>>> >>>>> Cheers, >>>>> Chris (Champion) >>>>> >>>>> ——— >>>>> >>>>> = Joshua Proposal = >>>>> >>>>> == Abstract == >>>>> [[joshua-decoder.org|Joshua]] is an open-source statistical machine >>>>> translation toolkit. It includes a Java-based decoder for translating >>>>> with >>>>> phrase-based, hierarchical, and syntax-based translation models, a >>>>> Hadoop-based grammar extractor (Thrax), and an extensive set of tools >>>>> and >>>>> scripts for training and evaluating new models from parallel text. >>>>> >>>>> == Proposal == >>>>> Joshua is a state of the art statistical machine translation system >>>>> that >>>>> provides a number of features: >>>>> >>>>> * Support for the two main paradigms in statistical machine >>>>> translation: >>>>> phrase-based and hierarchical / syntactic. >>>>> * A sparse feature API that makes it easy to add new feature templates >>>>> supporting millions of features >>>>> * Native implementations of many tuners (MERT, MIRA, PRO, and AdaGrad) >>>>> * Support for lattice decoding, allowing upstream NLP tools to expose >>>>> their hypothesis space to the MT system >>>>> * An efficient representation for models, allowing for quick loading of >>>>> multi-gigabyte model files >>>>> * Fast decoding speed (on par with Moses and mtplz) >>>>> * Language packs — precompiled models that allow the decoder to be run >>>>> as >>>>> a black box >>>>> * Thrax, a Hadoop-based tool for learning translation models from >>>>> parallel text >>>>> * A suite of tools for constructing new models for any language pair >>>>> for >>>>> which sufficient training data exists >>>>> >>>>> == Background and Rationale == >>>>> A number of factors make this a good time for an Apache project >>>>> focused on >>>>> machine translation (MT): the quality of MT output (for many language >>>>> pairs); the average computing resources available on computers, >>>>> relative >>>>> to the needs of MT systems; and the availability of a number of >>>>> high-quality toolkits, together with a large base of researchers >>>>> working >>>>> on them. >>>>> >>>>> Over the past decade, machine translation (MT; the automatic >>>>> translation >>>>> of one human language to another) has become a reality. The research >>>>> into >>>>> statistical approaches to translation that began in the early nineties, >>>>> together with the availability of large amounts of training data, and >>>>> better computing infrastructure, have all come together to produce >>>>> translations results that are “good enough” for a large set of language >>>>> pairs and use cases. Free services like >>>>> [[https://www.bing.com/translator|Bing Translator]] and >>>>> [[https://translate.google.com|Google Translate]] have made these >>>>> services >>>>> available to the average person through direct interfaces and through >>>>> tools like browser plugins, and sites across the world with higher >>>>> translation needs use them to translate their pages through >>>>> automatically. >>>>> >>>>> MT does not require the infrastructure of large corporations in order >>>>> to >>>>> produce feasible output. Machine translation can be resource-intensive, >>>>> but need not be prohibitively so. Disk and memory usage are mostly a >>>>> matter of model size, which for most language pairs is a few gigabytes >>>>> at >>>>> most, at which size models can provide coverage on the order of tens or >>>>> even hundreds of thousands of words in the input and output languages. >>>>> The >>>>> computational complexity of the algorithms used to search for >>>>> translations >>>>> of new sentences are typically linear in the number of words in the >>>>> input >>>>> sentence, making it possible to run a translation engine on a personal >>>>> computer. >>>>> >>>>> The research community has produced many different open source >>>>> translation >>>>> projects for a range of programming languages and under a variety of >>>>> licenses. These projects include the core “decoder”, which takes a >>>>> model >>>>> and uses it to translate new sentences between the language pair the >>>>> model >>>>> was defined for. They also typically include a large set of tools that >>>>> enable new models to be built from large sets of example translations >>>>> (“parallel data”) and monolingual texts. These toolkits are usually >>>>> built >>>>> to support the agendas of the (largely) academic researchers that build >>>>> them: the repeated cycle of building new models, tuning model >>>>> parameters >>>>> against development data, and evaluating them against held-out test >>>>> data, >>>>> using standard metrics for testing the quality of MT output. >>>>> >>>>> Together, these three factors—the quality of machine translation >>>>> output, >>>>> the feasibility of translating on standard computers, and the >>>>> availability >>>>> of tools to build models—make it reasonable for the end users to use >>>>> MT as >>>>> a black-box service, and to run it on their personal machine. >>>>> >>>>> These factors make it a good time for an organization with the status >>>>> of >>>>> the Apache Foundation to host a machine translation project. >>>>> >>>>> == Current Status == >>>>> Joshua was originally ported from David Chiang’s Python implementation >>>>> of >>>>> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins >>>>> University. The current version is maintained by Matt Post at Johns >>>>> Hopkins’ Human Language Technology Center of Excellence. Joshua has >>>>> made >>>>> many releases with a list of over 20 source code tags. The last >>>>> release of >>>>> Joshua was 6.0.5 on November 5th, 2015. >>>>> >>>>> == Meritocracy == >>>>> The current developers are familiar with meritocratic open source >>>>> development at Apache. Apache was chosen specifically because we want >>>>> to >>>>> encourage this style of development for the project. >>>>> >>>>> == Community == >>>>> Joshua is used widely across the world. Perhaps its biggest (known) >>>>> research / industrial user is the Amazon research group in Berlin. >>>>> Another >>>>> user is the US Army Research Lab. No formal census has been undertaken, >>>>> but posts to the Joshua technical support mailing list, along with the >>>>> occasional contributions, suggest small research and academic >>>>> communities >>>>> spread across the world, many of them in India. >>>>> >>>>> During incubation, we will explicitly seek to increase our usage across >>>>> the board, including academic research, industry, and other end users >>>>> interested in statistical machine translation. >>>>> >>>>> == Core Developers == >>>>> The current set of core developers is fairly small, having fallen with >>>>> the >>>>> graduation from Johns Hopkins of some core student participants. >>>>> However, >>>>> Joshua is used fairly widely, as mentioned above, and there remains a >>>>> commitment from the principal researcher at Johns Hopkins to continue >>>>> to >>>>> use and develop it. Joshua has seen a number of new community members >>>>> become interested recently due to a potential for its projected use in >>>>> a >>>>> number of ongoing DARPA projects such as XDATA and Memex. >>>>> >>>>> == Alignment == >>>>> Joshua is currently Copyright (c) 2015, Johns Hopkins University All >>>>> rights reserved and licensed under BSD 2-clause license. It would of >>>>> course be the intention to relicense this code under AL2.0 which would >>>>> permit expanded and increased use of the software within Apache >>>>> projects. >>>>> There is currently an ongoing effort within the Apache Tika community >>>>> to >>>>> utilize Joshua within Tika’s Translate API, see >>>>> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]]. >>>>> >>>>> == Known Risks == >>>>> >>>>> === Orphaned products === >>>>> At the moment, regular contributions are made by a single contributor, >>>>> the >>>>> lead maintainer. He (Matt Post) plans to continue development for the >>>>> next >>>>> few years, but it is still a single point of failure, since the >>>>> graduate >>>>> students who worked on the project have moved on to jobs, mostly in >>>>> industry. However, our goal is to help that process by growing the >>>>> community in Apache, and at least in growing the community with users >>>>> and >>>>> participants from NASA JPL. >>>>> >>>>> === Inexperience with Open Source === >>>>> The team both at Johns Hopkins and NASA JPL have experience with many >>>>> OSS >>>>> software projects at Apache and elsewhere. We understand "how it works" >>>>> here at the foundation. >>>>> >>>>> >>>>> == Relationships with Other Apache Products == >>>>> Joshua includes dependences on Hadoop, and also is included as a >>>>> plugin in >>>>> Apache Tika. We are also interested in coordinating with other projects >>>>> including Spark, and other projects needing MT services for language >>>>> translation. >>>>> >>>>> == Developers == >>>>> Joshua only has one regular developer who is employed by Johns Hopkins >>>>> University. NASA JPL (Mattmann and McGibbney) have been contributing >>>>> lately including a Brew formula and other contributions to the project >>>>> through the DARPA XDATA and Memex programs. >>>>> >>>>> == Documentation == >>>>> Documentation and publications related to Joshua can be found at >>>>> joshua-decoder.org. The source for the Joshua documentation is >>>>> currently >>>>> hosted on Github at >>>>> https://github.com/joshua-decoder/joshua-decoder.github.com >>>>> >>>>> == Initial Source == >>>>> Current source resides at Github: github.com/joshua-decoder/joshua >> (the >>>>> main decoder and toolkit) and github.com/joshua-decoder/thrax (the >>>>> grammar >>>>> extraction tool). >>>>> >>>>> == External Dependencies == >>>>> Joshua has a number of external dependencies. Only BerkeleyLM (Apache >>>>> 2.0) >>>>> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which is >>>>> needed for translating sentences with pre-built models). The rest are >>>>> dependencies for the build system and pipeline, used for constructing >>>>> and >>>>> training new models from parallel text. >>>>> >>>>> Apache projects: >>>>> * Ant >>>>> * Hadoop >>>>> * Commons >>>>> * Maven >>>>> * Ivy >>>>> >>>>> There are also a number of other open-source projects with various >>>>> licenses that the project depends on both dynamically (runtime), and >>>>> statically. >>>>> >>>>> === GNU GPL 2 === >>>>> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/ >>>>> >>>>> === LGPG 2.1 === >>>>> * KenLM: github.com/kpu/kenlm >>>>> >>>>> === Apache 2.0 === >>>>> * BerkeleyLM: https://code.google.com/p/berkeleylm/ >>>>> >>>>> === GNU GPL === >>>>> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html >>>>> >>>>> == Required Resources == >>>>> * Mailing Lists >>>>> * priv...@joshua.incubator.apache.org >>>>> * d...@joshua.incubator.apache.org >>>>> * comm...@joshua.incubator.apache.org >>>>> >>>>> * Git Repos >>>>> * https://git-wip-us.apache.org/repos/asf/joshua.git >>>>> >>>>> * Issue Tracking >>>>> * JIRA Joshua (JOSHUA) >>>>> >>>>> * Continuous Integration >>>>> * Jenkins builds on https://builds.apache.org/ >>>>> >>>>> * Web >>>>> * http://joshua.incubator.apache.org/ >>>>> * wiki at http://cwiki.apache.org >>>>> >>>>> == Initial Committers == >>>>> The following is a list of the planned initial Apache committers (the >>>>> active subset of the committers for the current repository on Github). >>>>> >>>>> * Matt Post (p...@cs.jhu.edu) >>>>> * Lewis John McGibbney (lewi...@apache.org) >>>>> * Chris Mattmann (mattm...@apache.org) >>>>> >>>>> == Affiliations == >>>>> >>>>> * Johns Hopkins University >>>>> * Matt Post >>>>> >>>>> * NASA JPL >>>>> * Chris Mattmann >>>>> * Lewis John McGibbney >>>>> >>>>> >>>>> == Sponsors == >>>>> === Champion === >>>>> * Chris Mattmann (NASA/JPL) >>>>> >>>>> === Nominated Mentors === >>>>> * Paul Ramirez >>>>> * Lewis John McGibbney >>>>> * Chris Mattmann >>>>> >>>>> == Sponsoring Entity == >>>>> The Apache Incubator >>>>> >>>>> >>>>> >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Chris Mattmann, Ph.D. >>>>> Chief Architect >>>>> Instrument Software and Science Data Systems Section (398) >>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>> Office: 168-519, Mailstop: 168-527 >>>>> Email: chris.a.mattm...@nasa.gov >>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Adjunct Associate Professor, Computer Science Department >>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> >>>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>>> For additional commands, e-mail: general-h...@incubator.apache.org >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>> For additional commands, e-mail: general-h...@incubator.apache.org >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >>
--------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org