> Of course you can fight spam. And the spammers can fight back. I prefer > algorithms that don't require an arms race with spammers. > > There are other problems with using query frequency. What about all the > legitimate users that type "google" or "facebook" into the query box > instead of into the location bar? What about the frequent queries that > don't match anything on your site?
How would that be a problem if you collect the information? The query logs provide numFound and QTime and a lot more information and we collect cookie ID's and (hashed) IP-address for the same request. We also collect the type of query is issued so we can identify a _legitimate_ (this is something we can reasonably detect) user using the same search terms when sorting, paging, facetting etc. If it is not a legitimate user we can act accordingly. This would count for +1 for the search term. The final count can then be passed through a log to flatten it out. If things still get out of control we would most likely deal with a DOS attack instead. > > If an algorithm needs that many patches, it is fundamentally a weak > approach. I do not agree. There are many conditions to consider. > > wunder > > On Sep 20, 2011, at 4:11 PM, Markus Jelsma wrote: > > A query log parser can be written to detect spam. At first you can use > > cookies (e.g. sessions) and IP-addresses to detect term spam. You can > > also limit a popularity spike to a reasonable mean size over a longer > > period. And you can limit rates using logarithms. > > > > There are many ways to deal with spam and maintain decent statistics. > > > > In practice, it's not a big problem on most sites. > > > >> Ranking suggestions based on query count would be trivially easy to > >> spam. Have a bot make my preferred queries over and over again, and > >> "boom" they are the most-preferred.