Re: [discovery] [Wikimedia-search] Page rank

Dan Garry Thu, 24 Sep 2015 20:50:23 -0700

Sorry, got myself confused here. Page rank and page views are different
concepts; I was thinking about a related task
<https://phabricator.wikimedia.org/T113439> to use page views to improve
scoring when writing this.


Thanks,
Dan

On 24 September 2015 at 20:22, Dan Garry <[email protected]> wrote:

> Thanks for the summary, Erik! It sounds very promising to me, and logical
> that we should use page views to affect the weight of the results. But, of
> course, we should be careful that we don't weight the page views so high
> that we end up giving the system criticality and creating a positive
> feedback loop, where random fluctuations in page views push up irrelevant
> results in the scoring, which gets them more page views, which pushes it up
> further, and so on.
>
> Dan
>
> On 21 September 2015 at 08:07, Erik Bernhardson <
> [email protected]> wrote:
>
>> Late last week while looking over our existing scoring methods i was
>> thinking that while counting incoming links is nice, a couple guys
>> dominated search with (among other things) a better way to judge the
>> quality of incoming links, aka PageRank.
>>
>> PageRank takes a very simple input, it just needs a list of all links
>> between pages. We happen to already store all of these in elasticsearch. I
>> wrote a few scripts to suck out the full enwiki graph (~400M edges), ship
>> it over to stat1002, throw it into hadoop, and crunch it with a few hundred
>> cores. The end result is a score for every NS_MAIN page in enwiki based on
>> the quality of incoming links.
>>
>> I've taken these calculated pagerank's and used them as the scoring
>> method for search-as-you-type for http://en-suggesty.wmflabs.org.
>>
>> Overall this seems promising as another scoring metric to integrate to
>> our search results. Not sure yet how to figure out things like how much
>> weight does pagerank have in the score? This might be yet another thing
>> where building out our relevance lab would enable us to make more informed
>> decisions.
>>
>> Overall i think some sort of pipeline from hadoop into our scoring system
>> could be quite useful.  The initial idea seems to be to crunch data in
>> hadoop, stuff it into a read-only api, and then query it back out at
>> indexing time in elasticsearch to be held within the ES docs. I'm not sure
>> what the best way will be, but having a simple and repeatable way to
>> calculate scoring info in hadoop and ship that into ES will probably become
>> more and more important.
>>
>> _______________________________________________
>> Wikimedia-search mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
>>
>>
>
>
> --
> Dan Garry
> Lead Product Manager, Discovery
> Wikimedia Foundation
>



-- 
Dan Garry
Lead Product Manager, Discovery
Wikimedia Foundation

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] [Wikimedia-search] Page rank

Reply via email to