The original request was for suggestions ranked purely by request count. You 
have designed something more complicated that probably works better.

When I built query completion at Netflix, I used the movie rental rates to rank 
suggestions. That was simple and very effective. We didn't need a more 
complicated system because we started with a good metric.

wunder

On Sep 20, 2011, at 4:34 PM, Markus Jelsma wrote:

> 
>> Of course you can fight spam. And the spammers can fight back. I prefer
>> algorithms that don't require an arms race with spammers.
>> 
>> There are other problems with using query frequency. What about all the
>> legitimate users that type "google" or "facebook" into the query box
>> instead of into the location bar? What about the frequent queries that
>> don't match anything on your site?
> 
> How would that be a problem if you collect the information? The query logs 
> provide numFound and QTime and a lot more information and we collect cookie 
> ID's and (hashed) IP-address for the same request.
> 
> We also collect the type of query is issued so we can identify a _legitimate_ 
> (this is something we can reasonably detect) user using the same search terms 
> when sorting, paging, facetting etc. If it is not a legitimate user we can 
> act 
> accordingly.
> 
> This would count for +1 for the search term. The final count can then be 
> passed through a log to flatten it out. If things still get out of control we 
> would most likely deal with a DOS attack instead.
> 
>> 
>> If an algorithm needs that many patches, it is fundamentally a weak
>> approach.
> 
> I do not agree. There are many conditions to consider.
> 
>> 
>> wunder
>> 
>> On Sep 20, 2011, at 4:11 PM, Markus Jelsma wrote:
>>> A query log parser can be written to detect spam. At first you can use
>>> cookies (e.g. sessions) and IP-addresses to detect term spam. You can
>>> also limit a popularity spike to a reasonable mean size over a longer
>>> period. And you can limit rates using logarithms.
>>> 
>>> There are many ways to deal with spam and maintain decent statistics.
>>> 
>>> In practice, it's not a big problem on most sites.
>>> 
>>>> Ranking suggestions based on query count would be trivially easy to
>>>> spam. Have a bot make my preferred queries over and over again, and
>>>> "boom" they are the most-preferred.





Reply via email to