My (very limited) understanding of "boilerpipe" in Tika is that it strips out "short text", which is great for all the menu and navigation text, but the typical disclaimer at the bottom of an email is not very short and frequently can be longer than the email message body itself. You may have to resort to a custom update processor that is programmed with some disclaimer signature text strings to be removed from field values.
-- Jack Krupansky
-----Original Message----- 
From: Mark , N
Sent: Tuesday, June 05, 2012 8:28 AM
To: solr-user@lucene.apache.org
Subject: filtering number and repeated contents

Is it possible to filter out numbers and disclaimer ( repeated contents)
while indexing to SOLR?
These are all surplus information and do not want to index it

I have tried using boilerpipe algorithm as well to remove surplus
infromation from web pages such as navigational elements, templates, and
advertisements , I think it works well but looking forward to see If I
could filter out  "disclaimer" information too mainly in email texts.
--
Thanks,

*Nipen Mark *

Reply via email to