Doing this well is harder. Giving a spam score to each page and boosting by a function on this score is probably a stronger tool.Can't remember where I found it. Gives a solid spam score algorithm for several easy-to-code text analyses and a scoring function. This assumes you pre-process.
Detecting Spam Web Pages through Content Analysis WWW 2006, May 23-26, 2006, Edinburgh, Scotland. ACM 1-59593-323-9/06/0005. Also "Z. Gyongyi and H. Garcia-Molina." have some interesting papers. -----Original Message----- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, April 11, 2008 1:12 PM To: solr-user@lucene.apache.org Subject: Re: capping term frequency? Hi, Probably by writing your own Similarity (Lucene codebase) and implementing the following method with capping: /** Implemented as <code>sqrt(freq)</code>. */ public float tf(float freq) { return (float)Math.sqrt(freq); } Then put that custom Similarity in a jar in Solr's lib and specify your Similarity FQCN at the bottom of solrconfig.xml Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: peter360 <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Friday, April 11, 2008 2:16:53 PM Subject: capping term frequency? Hi, How do I cap the term frequency when computing relevancy scores in solr? The problem is if a keyword repeats many times in the same document, I don't want it to hijack the relevancy score. Can I tell solr to cap the term frequency at a certain threshold? thanks. -- View this message in context: http://www.nabble.com/capping-term-frequency--tp16628189p16628189.html Sent from the Solr - User mailing list archive at Nabble.com.