The primary purpose of this filter is in conjunction with the KeywordRepeatFilterFactory and a stemmer, to remove the tokens that did not produce a stem from the original token, so the keyword duplicate is no longer needed. The goal is to index both the stemmed and unstemmed terms at the same position.

Whether your app is using the filter for that purpose remains to be seen.

Removing duplicates from the raw input token stream would impact the term frequency.

-- Jack Krupansky

-----Original Message----- From: Dotan Cohen
Sent: Friday, May 24, 2013 3:03 AM
To: solr-user@lucene.apache.org
Subject: Why would one not use RemoveDuplicatesTokenFilterFactory?

I am looking through the schema of a Solr installation that I
inherited last year. The original dev, who is unavailable for comment,
has two types of text fields: one with
RemoveDuplicatesTokenFilterFactory and one without. These fields are
intended for full-text search.

Why would someone _not_ use RemoveDuplicatesTokenFilterFactory on a
field intended for full-text search? What are the drawbacks to using
it? This application is very, very write heavy (hundreds of writes per
minute) if that matters. It was running on websolr.com at the time,
I've now moved it to Amazon Web Services.

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Reply via email to