Replies inline On Tue, Nov 3, 2015 at 2:25 PM, Trey Jones <[email protected]> wrote:
> Sorry I didn't respond to this sooner! > > I really like the idea of trying to detect what languages the user can > read, and searching in (a subset of) those. This wouldn't benefit from > relevance lab testing, though. It'll need to be measured against the user > satisfaction metric. (BTW, Do we have a sense of how many users have info > we can detect for this?) > > I think the biggest problem with language detection is the quality of the > language detector. The Elastic Search plugin we tested has a Romanian > fetish when run on our queries (Erik got about 38% Romanian on 100K enwiki > searches, which is crazy, and I got 0% accuracy for Romanian on my much > smaller tagged corpus of failed (zero results) queries to enwiki). Most of > the time, I would expect queries sent to the wrong wiki to fail (though > there are some exceptions)—but a query in English that does get hits in > rowiki is going to just look wrong most of the time. > > There are several proposals for improving language detection in the > etherpad, and we can work on them in parallel, since any given one could be > better than any other one. (We don't want to make 100 of them, but a few to > test and compare would be nice—there may also be reasonable speed/accuracy > tradeoffs to be made, e.g., 2% decrease in accuracy for 2x speed is a good > deal.) > > My worry here is we would then need to productionize it. Several of the options i see are basically libraries that we would have to build a service (or ES plugin) around. I do think we should investigate this and decide if the effort to productionize is worth the impact we are able to estimate in relevance lab. We need training and evaluation data. I see a few ways of getting it. The > easy, lower-quality way is just take queries from a given wiki and assume > they are in the language in question (i.e., eswiki queries are in Spanish). > Easy, not 100% accurate, unlimited supply. The hard, higher-quality way is > to hand annotate a corpus of queries. This is slow, but doable. I can do on > the order of 1000 queries in a day—more if I were less accurate and more > willing to toss stuff into the junk pile. I couldn't do it for a week > straight, though, without going crazy. A possible middle of the road > approach would be to create a feedback loop and run detectors on our > training data and review and remove items that are not in the desired > language (we could also start by filtering things that are not in the right > character set, like removing all Arabic, Cyrillic, and Chinese from enwiki, > frwiki, and eswiki queries). If we want thousands of hand-annotated > queries, we need to get annotating! > > This is probably the biggest sticking point. Another random idea: We have speakers of several languages on the team and in the foundation (as in, under NDA and can review queries that are PII), would it be enough to grab example queries from wiki's of the correct language and then have someone that knows the language filter through them and delete nonsensical / wrong language queries? I'm guessing this would go faster, but not sure it's as valuable. > I think we can use the relevance lab to help evaluate a language detector > (at least with respect to zero results rate). We could run the detector > against a pile of zero-results queries, then group the queries by detected > language, and run them against the relevant wiki (if we have room in labs > for the indexes, and we update the relevance lab tools to support choosing > a target wiki to search). We wouldn't be comparing "before" and "after", > but just measuring the zero results rate against the target wiki. As any > time we're using zero-results rate, there's no guarantee that we'll be > giving good results, just results (e.g., "unix time stamp" queries with > English words fail on enwiki but sometimes work on zhwiki for some reason, > but that's not really better.) > > I'm somewhat worried about being able to reduce the targeted zero results > rate by 10%. In my test[1], only 12% of non-DOI zero-results queries were > "in a language", and only about a third got results when searched in the > "correct" (human-determined) wiki. I didn't filter bots other than the DOI > bot, and some non-language queries (e.g., names) might get results in > another wiki, but there may not be enough wiggle room. There's a lot of > junk in other languages, too, but maybe filtering bots will help more than > I dare presume. > > I'm also worried about that portion, but perhaps a nuanced reading could help us? If a 10% increase in satisfaction is 15% -> 16.5%, then a 10% reduction in ZRR is 30% -> 27%. We don't yet have the numbers for non-automata so it's harder to say what exactly it is, but we finally have the data into hadoop which should make it possible to determine non-automata related issues. > [1] > https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Cross_Language_Wiki_Searching#Perfect_identification.2C_ignoring_non-language_queries > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation > > On Mon, Nov 2, 2015 at 9:03 PM, Erik Bernhardson < > [email protected]> wrote: > >> It measures the zero results rate for 1 in 10 search requests via >> CirrusSearchUserTesting log that we used last quarter. >> >> On Mon, Nov 2, 2015 at 6:01 PM, Oliver Keyes <[email protected]> >> wrote: >> >>> Define this "does it do anything?" test? >>> >>> On 2 November 2015 at 19:58, Erik Bernhardson >>> <[email protected]> wrote: >>> > Now that we have the feature deployed (behind a feature flag), and >>> have an >>> > initial "does it do anything?" test going out today, along with an >>> upcoming >>> > integration with our satisfaction metrics, we need to come up with how >>> will >>> > will try to further move the needle forward. >>> > >>> > For reference these are our Q2 goals: >>> > >>> > Run A/B test for a feature that: >>> > >>> > Uses a library to detect the language of a user's search query. >>> > Adjusts results to match that language. >>> > >>> > Determine from A/B test results whether this feature is fit to push to >>> > production, with the aim to: >>> > >>> > Improve search user satisfaction by 10% (from 15% to 16.5%). >>> > Reduce zero results rate for non-automata search queries by 10%. >>> > >>> > We brainstormed a number of possibilities here: >>> > >>> > https://etherpad.wikimedia.org/p/LanguageSupportBrainstorming >>> > >>> > >>> > We now need to decide which of these ideas we should prioritize. We >>> might >>> > want to take into consideration which of these can be pre-tested with >>> our >>> > relevancy lab work, such that we can prefer to work on things we think >>> will >>> > move the needle the most. I'm really not sure which of these to push >>> forward >>> > on, so let us know which you think can have the most impact, or where >>> the >>> > expected impact could be measured with relevancy lab with minimal work. >>> > >>> > >>> > >>> > _______________________________________________ >>> > discovery mailing list >>> > [email protected] >>> > https://lists.wikimedia.org/mailman/listinfo/discovery >>> > >>> >>> >>> >>> -- >>> Oliver Keyes >>> Count Logula >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> discovery mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/discovery >>> >> >> >> _______________________________________________ >> discovery mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/discovery >> >> > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > >
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
