Nice stuff!

Should we set up a meeting to talk more in depth about this, as we're about
2 weeks out from the Hackathon right now?

Cheers,

Deb

--

deb tankersley

Program Manager, Engineering

Wikimedia Foundation

On Wed, May 2, 2018 at 8:39 AM, Trey Jones <[email protected]> wrote:

> I've got my own list of more language-focused not-necessarily-great ideas,
> in order of my current desire to work on them:
>
>    - Mirandese (mwl) analysis plugin built from Portuguese and French
>    parts, plus a stop list provided by an mwl editor
>    - plugin to merge high surrogates and low surrogates that get split up
>    by the Chinese analyzer
>    - plugin to do automatic homoglyph corrections
>    - plugin to do transliteration for languages where it is relatively
>    easy (Serbian was on the list, but it’s already done!—and for very simple
>    mappings this is just a char map)
>    - look into ways of automatically generating a stemmer from Wiktionary
>    conjugation/declension data (maybe start with Estonian?)
>    - compare the analyzers for the top 5-10 wiki languages by volume, and
>    look for ways to increase consistency among them
>    - develop a different statistical approach to detect wrong keyboard
>    typing and build a search-only filter to generate alternative tokens—for
>    Russian/English, Hebrew/English, OR one hand on wrong home row
>    - update RelForge with some additional metrics I’ve been collecting
>    - project Wordnet or other thesaurus/ontology onto short strings
>    (e.g., Commons descriptions, Wikipedia titles, etc.) to determine useful
>    thesaurus terms and prune the rest
>    - recheck differences in unpacked vs monolithic analyzers (eliminating
>    our automatic upgrades, which 98% likely to have caused the diffs)
>    - “Bollywood detector”—identify and map Bollywood movie names into
>    multiple scripts
>
> I was planning to work on the Mirandese analysis plugin and maybe one of
> the next three on the list. But if anyone wants to collaborate on any of
> the others, I'm happy to do so.
>
> Trey Jones
> Sr. Software Engineer, Search Platform
> Wikimedia Foundation
>
> On Tue, May 1, 2018 at 6:14 PM, Erik Bernhardson <
> [email protected]> wrote:
>
>> With the hackathon coming up I thought we could ponder what could be done
>> while there. I've been constructing a list of horrible ideas over the last
>> couple weeks:
>>
>>
> _______________________________________________
> Discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>
_______________________________________________
Discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Reply via email to