No. My project is retail based. I mean people putting in a slew of irrelevant keywords in addition to relevant keywords in an attempt to get hits on searches and hits outside of context.
I used a filter factory to remove duplicates. On 23 April 2016 at 11:30, Doug Turnbull < dturnb...@opensourceconnections.com> wrote: > By keyword spamming, do you mean stuffing the same term over and over to > game term frequency? > > If so You might want to try tuning BM25 similarity for your needs. It has a > saturation point for term frequency. > > > http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/ > > You can also write your own similarity that sets a max for term frequency. > > I'd also consider figuring out if you can build a page rank like measure > that can signal content trustworthiness. Spammer sites won't be linked to > very heavily by trusted sites. > > If you just mean spamming like lots of unique keywords, length > normalization was built just for this reason: to bias relevance toward less > verbose and more specific matches > > Hope that helps > > Doug > On Sat, Apr 23, 2016 at 10:02 AM GW <thegeofo...@gmail.com> wrote: > > > Hey all, > > > > I'm just finishing up a project and I'm hoping for some direction on > > dealing with keyword spamming. > > > > I don't have any urgent issues. I can foresee some bumps in the road. > > > > I'm using a custom spider that pulls inventory data from several dozen > > sources into a single doc schema. 1 record per item per location. > > > > Data from several sources have an existing keyword field. Some records > > coming in have empty or null data for keywords. > > > > I concatenated my category and keyword data into the keyword field so I > > would not have any empty keyword data to satisfy a query builder. > > > > I have a recommended keyword list I could use to count hits before I > index. > > It's a painful thought. > > > > I want to be able to detect people that are trying to do keyword > spamming. > > > > So my question is: Is there some kind of FM that I'm not aware of? > > > > Thanks in advance, > > > > GW > > >