Sorry, got myself confused here. Page rank and page views are different concepts; I was thinking about a related task <https://phabricator.wikimedia.org/T113439> to use page views to improve scoring when writing this.
Thanks, Dan On 24 September 2015 at 20:22, Dan Garry <[email protected]> wrote: > Thanks for the summary, Erik! It sounds very promising to me, and logical > that we should use page views to affect the weight of the results. But, of > course, we should be careful that we don't weight the page views so high > that we end up giving the system criticality and creating a positive > feedback loop, where random fluctuations in page views push up irrelevant > results in the scoring, which gets them more page views, which pushes it up > further, and so on. > > Dan > > On 21 September 2015 at 08:07, Erik Bernhardson < > [email protected]> wrote: > >> Late last week while looking over our existing scoring methods i was >> thinking that while counting incoming links is nice, a couple guys >> dominated search with (among other things) a better way to judge the >> quality of incoming links, aka PageRank. >> >> PageRank takes a very simple input, it just needs a list of all links >> between pages. We happen to already store all of these in elasticsearch. I >> wrote a few scripts to suck out the full enwiki graph (~400M edges), ship >> it over to stat1002, throw it into hadoop, and crunch it with a few hundred >> cores. The end result is a score for every NS_MAIN page in enwiki based on >> the quality of incoming links. >> >> I've taken these calculated pagerank's and used them as the scoring >> method for search-as-you-type for http://en-suggesty.wmflabs.org. >> >> Overall this seems promising as another scoring metric to integrate to >> our search results. Not sure yet how to figure out things like how much >> weight does pagerank have in the score? This might be yet another thing >> where building out our relevance lab would enable us to make more informed >> decisions. >> >> Overall i think some sort of pipeline from hadoop into our scoring system >> could be quite useful. The initial idea seems to be to crunch data in >> hadoop, stuff it into a read-only api, and then query it back out at >> indexing time in elasticsearch to be held within the ES docs. I'm not sure >> what the best way will be, but having a simple and repeatable way to >> calculate scoring info in hadoop and ship that into ES will probably become >> more and more important. >> >> _______________________________________________ >> Wikimedia-search mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search >> >> > > > -- > Dan Garry > Lead Product Manager, Discovery > Wikimedia Foundation > -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
