I am looking for a way to identify blocks of text that occur in several documents in a corpus for a research project with electronic medical records. They can be copied and pasted sections inserted into another document, text from a previous email in the corpus that is repeated in a follow-up email, text templates that get inserted into groups of documents, and occurrences of the same template more than once in the same document. Any of these duplicated text blocks may contain minor differences from one instance to another.
I read in a document called "What's new in Solr 1.4" that there has been support since 1.4 came out for duplicate text detection using the SignatureUpdateProcessor and TextProfileSignature classes. Can these be used to detect portions of documents that are alike or nearly alike, or are they intended to detect entire documents that are alike or nearly alike? Has additional support for duplicate detection been added to Solr since 1.4? It seems like some of the features of Solr and Lucene such as term positions and shingling could help in finding sections of matching or nearly matching text in documents. Does anyone have any experience in this area that they would be willing to share? Thanks, Mike