The problem here is defining "irrelevant". There's nothing in Solr that magically can determine "this term is irrelevant in this doc, but this other one isn't".
Best, Erick On Sat, Apr 23, 2016 at 11:08 AM, GW <thegeofo...@gmail.com> wrote: > No. My project is retail based. I mean people putting in a slew of > irrelevant keywords in addition to relevant keywords in an attempt to get > hits on searches and hits outside of context. > > I used a filter factory to remove duplicates. > > On 23 April 2016 at 11:30, Doug Turnbull < > dturnb...@opensourceconnections.com> wrote: > >> By keyword spamming, do you mean stuffing the same term over and over to >> game term frequency? >> >> If so You might want to try tuning BM25 similarity for your needs. It has a >> saturation point for term frequency. >> >> >> http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/ >> >> You can also write your own similarity that sets a max for term frequency. >> >> I'd also consider figuring out if you can build a page rank like measure >> that can signal content trustworthiness. Spammer sites won't be linked to >> very heavily by trusted sites. >> >> If you just mean spamming like lots of unique keywords, length >> normalization was built just for this reason: to bias relevance toward less >> verbose and more specific matches >> >> Hope that helps >> >> Doug >> On Sat, Apr 23, 2016 at 10:02 AM GW <thegeofo...@gmail.com> wrote: >> >> > Hey all, >> > >> > I'm just finishing up a project and I'm hoping for some direction on >> > dealing with keyword spamming. >> > >> > I don't have any urgent issues. I can foresee some bumps in the road. >> > >> > I'm using a custom spider that pulls inventory data from several dozen >> > sources into a single doc schema. 1 record per item per location. >> > >> > Data from several sources have an existing keyword field. Some records >> > coming in have empty or null data for keywords. >> > >> > I concatenated my category and keyword data into the keyword field so I >> > would not have any empty keyword data to satisfy a query builder. >> > >> > I have a recommended keyword list I could use to count hits before I >> index. >> > It's a painful thought. >> > >> > I want to be able to detect people that are trying to do keyword >> spamming. >> > >> > So my question is: Is there some kind of FM that I'm not aware of? >> > >> > Thanks in advance, >> > >> > GW >> > >>