Re: Identifying common text in documents

2011-12-24 Thread Lance Norskog
Great topic! 1) SignatureUpdateProcessor creates a hash of the exact byte stream of the document. Often your crawling software can't do an incremental update of your data, but can only re-index the entire corpus. The SUP makes the hash, searches for it, and it it is there the document indexer says

Identifying common text in documents

2011-12-24 Thread Mike O'Leary
I am looking for a way to identify blocks of text that occur in several documents in a corpus for a research project with electronic medical records. They can be copied and pasted sections inserted into another document, text from a previous email in the corpus that is repeated in a follow-up em