> The original request was for suggestions ranked purely by request count.
> You have designed something more complicated that probably works better.
> 
> When I built query completion at Netflix, I used the movie rental rates to
> rank suggestions. That was simple and very effective. We didn't need a
> more complicated system because we started with a good metric.

Good point! I got carried away by a user asking about sorting on request 
count. A metric like the one you describe is a lot easier indeed :)

Cheers

> 
> wunder
> 
> On Sep 20, 2011, at 4:34 PM, Markus Jelsma wrote:
> >> Of course you can fight spam. And the spammers can fight back. I prefer
> >> algorithms that don't require an arms race with spammers.
> >> 
> >> There are other problems with using query frequency. What about all the
> >> legitimate users that type "google" or "facebook" into the query box
> >> instead of into the location bar? What about the frequent queries that
> >> don't match anything on your site?
> > 
> > How would that be a problem if you collect the information? The query
> > logs provide numFound and QTime and a lot more information and we
> > collect cookie ID's and (hashed) IP-address for the same request.
> > 
> > We also collect the type of query is issued so we can identify a
> > _legitimate_ (this is something we can reasonably detect) user using the
> > same search terms when sorting, paging, facetting etc. If it is not a
> > legitimate user we can act accordingly.
> > 
> > This would count for +1 for the search term. The final count can then be
> > passed through a log to flatten it out. If things still get out of
> > control we would most likely deal with a DOS attack instead.
> > 
> >> If an algorithm needs that many patches, it is fundamentally a weak
> >> approach.
> > 
> > I do not agree. There are many conditions to consider.
> > 
> >> wunder
> >> 
> >> On Sep 20, 2011, at 4:11 PM, Markus Jelsma wrote:
> >>> A query log parser can be written to detect spam. At first you can use
> >>> cookies (e.g. sessions) and IP-addresses to detect term spam. You can
> >>> also limit a popularity spike to a reasonable mean size over a longer
> >>> period. And you can limit rates using logarithms.
> >>> 
> >>> There are many ways to deal with spam and maintain decent statistics.
> >>> 
> >>> In practice, it's not a big problem on most sites.
> >>> 
> >>>> Ranking suggestions based on query count would be trivially easy to
> >>>> spam. Have a bot make my preferred queries over and over again, and
> >>>> "boom" they are the most-preferred.

Reply via email to