Great topic!
1) SignatureUpdateProcessor creates a hash of the exact byte stream of
the document. Often your crawling software can't do an incremental
update of your data, but can only re-index the entire corpus. The SUP
makes the hash, searches for it, and it it is there the document
indexer says
I am looking for a way to identify blocks of text that occur in several
documents in a corpus for a research project with electronic medical records.
They can be copied and pasted sections inserted into another document, text
from a previous email in the corpus that is repeated in a follow-up em