David Causse: quite welcome. Thanks for looking in to why those queries
wen't doing well and attempting to correct them.
Erik Bernhardson: simply amazing and fast work. Calculating SAT/DSAT from
the (logged search sessions & dwell time) will give you a cleaner sense of
how well you're
satisfying the user's query intent. From there you can calculate
time-to-success/time-to-non-success and session-success-rate.

In '*Number of searches per session'* table, it's interesting that 69.35%
of users only do a single query. Given that the click rate is low, perhaps
there's a way to get user to do a second query. For example, adding related
queries, loosening the spell correction, offer search tips, etc. Showing
top/trending/recommended queries, trending pages, etc won't satisfy the
user's original intent but may help other types of long term user
satisfaction metrics like sessions-per-user-per-month.

Removing ZRP (zero results pages) from the click rate may give you
interesting results. Removing the ZRP will focus more on precision issues;
ZRP is a recall issue (assuming relevant content exists). It will let you
look at for instance, the top queries which had results, but users don't
click.

There can also be good-abandonment
<http://research.microsoft.com/en-us/um/people/sdumais/CIKM2012-fp085-diriye.pdf>
[pdf] if
the user gets the answer directly, eg, when a calculator answer is
displayed and the user's query intent is satisfied.

*Recall improvements*
The general way to improve recall is to expand the number of ways to match
the pages (other main way is to get more documents). I'm assuming you're
already doing stemming/phonetics/fuzzy-matching in ES (which expands the
comparison methods). You can also expand the number of way to match on from
both end, either by expanding the query (query rewriting/augmentation), or
augmenting the data in the index for each document.

Query side - Expanding the query can be for example turning the query
{automobile} in to {automobile OR car OR truck OR mobile-vechicle} which
can be grabbed from wordnet <https://en.wikipedia.org/wiki/WordNet> and
other source. ES also has built in support for synonyms like wordnet. You
may also expand the query to related term, like {automobile} to {automobile
OR car-door OR catalytic-converter} as article with automobile components
are also about a car.

Index side - Expanding the index side is quite fun too. For example you
could grab the content of each reference and add that to the index. You
could pull in the title of each wiki page pointing to the current page. You
could pull in the translations of each localized page, ex: translate
[w:en:car] to Russian and add the machine generated translation's content
to [w:ru:Автомобиль]. You can pull in all data from
wikidata/dbpedia/freebase/etc in to the index.

Most recall improvements are at the expense of precision, aka added noise.

*Identifying poor performing queries*
To get insight into under performing classes of queries, you can look at
users who search on wikipedia then are shortly re-seen coming back from
Google/Bing/DDG/etc and compare why the given query string worked to get
them to the article and their original query did not. This set may be too
small, so you could compare all incoming queries for a page from Google/et.
al to the wikipedia queries and look for differences. The Google/et. al
queries which aren't seen in the wiki query set leading to the page will be
interesting and highlight deficiencies.

Taking a look at the top queries w/ a low click rate (or better a high DSAT
click rate, or high session abandonment rate) may be interesting.

Good clean logs help investigations and metrics and it looks like for first
step maybe removing removing spam/bots/etc. With a clean log, you can find
classes of real queries (as opposed to the odd queries
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Results_Queries>
issues by bots, etc) that don't receive clicks, or results.

*Measuring current relevance*
Have you folks setup a judgement system? Just a simple way to mark the
quality of a result vs. a query. Generally you'd want to take the results
from a single query and randomly order them and have a human judge a score
for each result vs the query. The random order removes positional bias
effects. You may want to pull in a set of SERP results from
Google/Bing/DDG/{local search engine for language X} in to the same
judgement set which will let you directly compare the quality of each. I'd
recommend NDCG
<https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG>@1,3,10
as an evaluation metric. A side benefit of a manual judgement is it
produces a clean training/evaluation set for testing ranking improvements
offline (before A/B testing). Small test-iterate-fail cycles are great. In
the long term, these can be used to train machine learned rankers, or can
be augmented/replaced by SAT/DSAT based training sets (learning to rank
<https://en.wikipedia.org/wiki/Learning_to_rank>).

--justin

On Thu, Feb 11, 2016 at 1:53 AM, David Causse <[email protected]> wrote:

> Thank you for all your suggestions, very inspiring!
> [response inline]
>
> Le 10/02/2016 21:33, Justin Ormont a écrit :
>
> Good hits on page two:
>
> There's a few cases where good results could exist only on page two.
>
> One case is when incorrectly searching for a homophone or other
> misspelling. Eg: "their red hot" instead of "they're red hot" (expected
> result <https://en.wikipedia.org/wiki/They%27re_Red_Hot> -- wikipedia
> <https://en.wikipedia.org/w/index.php?search=their+red+hot&title=Special%3ASearch>
>  (pos
> 22), google
> <https://www.google.com/search?q=their+red+hot&oq=their+red+hot> (pos 1),
> bing <https://www.bing.com/search?q=their+red+hot> (pos 2), ddg
> <https://duckduckgo.com/?q=their+red+hot> (pos 2)).
>
>
> Indeed, we do a pretty bad job for this kind of queries. But I still don't
> know how to address that correctly. We don't use any synonym resources yet.
> This is usually addressed by the list of curated redirects, in this example
> we're able to catch only "theyre red hot" but we fail for their/there/....
>
>
>
> Another case is when you get an exact string match on incorrect pages, but
> only non-exact string match on the correct page. Eg: "Cities in the San
> Francisco Bay Area" (expected result
> <https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_the_San_Francisco_Bay_Area>
> -- wikipedia
> <https://en.wikipedia.org/w/index.php?title=Special:Search&search=Cities+in+the+San+Francisco+Bay+Area>
>  (pos
> 122), google
> <https://www.google.com/search?q=Cities+in+the+San+Francisco+Bay+Area> (pos
> 1), bing
> <https://www.bing.com/search?q=Cities+in+the+San+Francisco+Bay+Area> (pos
> 1), ddg <https://duckduckgo.com/?q=Cities+in+the+San+Francisco+Bay+Area> (pos
> 1)).
>
> This style occurs mostly for a navigation query (only one correct result).
> For explorative queries, odds are one of the relevant results will be on
> page 1.
>
> There's a couple less direct cases, for instance if/once you integrate a
> popularity score, freshness score, importance score, page query score, or
> personalization (eg. ranking by physical distance from user or user's
> interests), you'll find some examples where incorrect results are
> non-helpfully boosted.
>
>
> You're completely right and this is exactly the case here. We always
> rescore the top 8000 documents (per node) with the number of incoming links
> (which is far from ideal). By disabling all the top-N rescoring features
> the expected result is now #2:
>
>
> https://en.wikipedia.org/w/index.php?search=Cities+in+the+San+Francisco+Bay+Area&title=Special%3ASearch&go=Go&cirrusBoostLinks=no&cirrusPhraseWinwdow=1&cirrusPhraseWindow=1
>
> We don't do anything smart here, it's always the same plan whatever the
> query is...
>
>
> Investigating queries which lead to clicks on page two may find
> interesting things popping out.
>
> --
>
> Knowing the SAT/DSAT-click-rate-vs.-position will tell you if good clicks
> often occur beyond position 10. Then running an experiment of 10 SERP
> results vs. 20 SERP results may give interesting insights when watching a
> session-success-rate metric (and maybe a time-to-success metric). Aka,
> checking if a click on position 11+ is almost ever useful, or just leads to
> a requery or abandonment. If you run result size experiments, you can
> normalize for the query latency effects by generating 20 and displaying 10.
>
> The need of scrolling can cause a faster fall off of the click rates
> listed. On my web browser, as it's currently sized, there are only three
> results above the fold (my open advanced facet block takes a lot of space,
> scrolling required for result 4+). Knowing how-much/if the click rate drops
> for results below the fold will also help optimize the number of results to
> display, snippet length, and UI design. Could instrument number of results
> above the fold.
>
> --
>
> Side note: possible bug, I can't find the page "List of New York
> University alumni
> <https://en.wikipedia.org/wiki/List_of_New_York_University_alumni>" when
> querying "New York University alumni
> <https://en.wikipedia.org/w/index.php?search=New+York+University+alumni&title=Special%3ASearch&go=Go>"
> (screenshot <https://imgur.com/SymW9tv>).
>
>
> Yes... I usually find the params to tweak the query and push good results
> near the top but not here...
> I'll have to dig into more details to see what's going on, the best I can
> do is a rank around pos 120 :(
>
> Thank you very much for your help!
>
> --
> David
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>
_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Reply via email to