Re: [discovery] Next steps for language goal

Trey Jones Thu, 05 Nov 2015 08:04:15 -0800

Trimming down my reply to certain topics...

On Wed, Nov 4, 2015 at 3:20 PM, Erik Bernhardson <[email protected]
> wrote:

> On Tue, Nov 3, 2015 at 2:25 PM, Trey Jones <[email protected]> wrote:
>
>> There are several proposals for improving language detection in the
>> etherpad, and we can work on them in parallel
>>
>

> My worry here is we would then need to productionize it. Several of the
> options i see are basically  libraries that we would have to build a
> service (or ES plugin) around. I do think we should investigate this and
> decide if the effort to productionize is worth the impact we are able to
> estimate in relevance lab.
>

Yep—I always had language detection and translation as use cases in mind
when thinking about the relevance lab. We can test a lot of stuff without
productionizing it, which means it's less work to try stuff out and we
don't have to commit early.

> We need training and evaluation data.
>
>
> This is probably the biggest sticking point. Another random idea: We have
> speakers of several languages on the team and in the foundation (as in,
> under NDA and can review queries that are PII), would it be enough to grab
> example queries from wiki's of the correct language and then have someone
> that knows the language filter through them and delete nonsensical / wrong
> language queries? I'm guessing this would go faster, but not sure it's as
> valuable.

This is a good idea if people are willing to do it, and it's faster and
easier if you have only two buckets ("this language" and "not this
language") because anything you don't recognize automatically goes into
"not this language". You don't have to be a great speaker of the language
to do a good job, either.

We also need to think about whether we want general language
identification, or if we want to tailor it per wiki for better results. At
the most course grained, think "Romanian" on enwiki vs rowiki. But there is
also the matter of what languages actually appear in queries on each wiki.
So, should we limit to the 10 most common non-English query languages on
enwiki? (So we can correctly say "your query is in X but didn't get results
on X wiki"?) Or the 10 most likely to get results on the right wiki? (So we
can give more results.) Limiting the scope limits the data we need to
collect, and increases precision (and probably recall) for enwiki, but the
resulting detector can't be used on other wikis (and probably can't be used
without modification on other wikis that are in English!), though the
training data can be reused.

We should probably talk through this a bit more.

> I'm somewhat worried about being able to reduce the targeted zero results
>> rate by 10%. In my test, only 12% of non-DOI zero-results queries were "in
>> a language", and only about a third got results when searched in the
>> "correct" (human-determined) wiki. I didn't filter bots other than the DOI
>> bot, and some non-language queries (e.g., names) might get results in
>> another wiki, but there may not be enough wiggle room. There's a lot of
>> junk in other languages, too, but maybe filtering bots will help more than
>> I dare presume.
>>
>
I'm also worried about that portion, but perhaps a nuanced reading could
> help us? If a 10% increase in satisfaction is 15% -> 16.5%, then a 10%
> reduction in ZRR is 30% -> 27%. We don't yet have the numbers for
> non-automata so it's harder to say what exactly it is, but we finally have
> the data into hadoop which should make it possible to determine
> non-automata related issues.

Yeah, we need to be able to effectively sample what we want to affect so we
can gauge how well anything we try actually works.

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] Next steps for language goal

Reply via email to