Nice stuff! Should we set up a meeting to talk more in depth about this, as we're about 2 weeks out from the Hackathon right now?
Cheers, Deb -- deb tankersley Program Manager, Engineering Wikimedia Foundation On Wed, May 2, 2018 at 8:39 AM, Trey Jones <[email protected]> wrote: > I've got my own list of more language-focused not-necessarily-great ideas, > in order of my current desire to work on them: > > - Mirandese (mwl) analysis plugin built from Portuguese and French > parts, plus a stop list provided by an mwl editor > - plugin to merge high surrogates and low surrogates that get split up > by the Chinese analyzer > - plugin to do automatic homoglyph corrections > - plugin to do transliteration for languages where it is relatively > easy (Serbian was on the list, but it’s already done!—and for very simple > mappings this is just a char map) > - look into ways of automatically generating a stemmer from Wiktionary > conjugation/declension data (maybe start with Estonian?) > - compare the analyzers for the top 5-10 wiki languages by volume, and > look for ways to increase consistency among them > - develop a different statistical approach to detect wrong keyboard > typing and build a search-only filter to generate alternative tokens—for > Russian/English, Hebrew/English, OR one hand on wrong home row > - update RelForge with some additional metrics I’ve been collecting > - project Wordnet or other thesaurus/ontology onto short strings > (e.g., Commons descriptions, Wikipedia titles, etc.) to determine useful > thesaurus terms and prune the rest > - recheck differences in unpacked vs monolithic analyzers (eliminating > our automatic upgrades, which 98% likely to have caused the diffs) > - “Bollywood detector”—identify and map Bollywood movie names into > multiple scripts > > I was planning to work on the Mirandese analysis plugin and maybe one of > the next three on the list. But if anyone wants to collaborate on any of > the others, I'm happy to do so. > > Trey Jones > Sr. Software Engineer, Search Platform > Wikimedia Foundation > > On Tue, May 1, 2018 at 6:14 PM, Erik Bernhardson < > [email protected]> wrote: > >> With the hackathon coming up I thought we could ponder what could be done >> while there. I've been constructing a list of horrible ideas over the last >> couple weeks: >> >> > _______________________________________________ > Discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > >
_______________________________________________ Discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
