Doing this well is harder. Giving a spam score to each page and boosting
by a function on this score is probably a stronger tool.Can't remember
where I found it. Gives a solid spam score algorithm for several
easy-to-code text analyses and a scoring function. This assumes you
pre-process.

Detecting Spam Web Pages through Content Analysis
WWW 2006, May 23-26, 2006, Edinburgh, Scotland.
ACM 1-59593-323-9/06/0005.

Also "Z. Gyongyi and H. Garcia-Molina." have some interesting papers. 



-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 11, 2008 1:12 PM
To: solr-user@lucene.apache.org
Subject: Re: capping term frequency?

Hi,

Probably by writing your own Similarity (Lucene codebase) and
implementing the following method with capping:

  /** Implemented as <code>sqrt(freq)</code>. */
  public float tf(float freq) {
    return (float)Math.sqrt(freq);
  }

Then put that custom Similarity in a jar in Solr's lib and specify your
Similarity FQCN at the bottom of solrconfig.xml

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: peter360 <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, April 11, 2008 2:16:53 PM
Subject: capping term frequency?


Hi,
How do I cap the term frequency when computing relevancy scores in solr?

The problem is if a keyword repeats many times in the same document, I
don't want it to hijack the relevancy score.  Can I tell solr to cap the
term frequency at a certain threshold?

thanks.
--
View this message in context:
http://www.nabble.com/capping-term-frequency--tp16628189p16628189.html
Sent from the Solr - User mailing list archive at Nabble.com.




Reply via email to