My (very limited) understanding of "boilerpipe" in Tika is that it strips
out "short text", which is great for all the menu and navigation text, but
the typical disclaimer at the bottom of an email is not very short and
frequently can be longer than the email message body itself. You may have to
resort to a custom update processor that is programmed with some disclaimer
signature text strings to be removed from field values.
-- Jack Krupansky
-----Original Message-----
From: Mark , N
Sent: Tuesday, June 05, 2012 8:28 AM
To: solr-user@lucene.apache.org
Subject: filtering number and repeated contents
Is it possible to filter out numbers and disclaimer ( repeated contents)
while indexing to SOLR?
These are all surplus information and do not want to index it
I have tried using boilerpipe algorithm as well to remove surplus
infromation from web pages such as navigational elements, templates, and
advertisements , I think it works well but looking forward to see If I
could filter out "disclaimer" information too mainly in email texts.
--
Thanks,
*Nipen Mark *