Identifying common text in documents

Mike O'Leary Sat, 24 Dec 2011 12:44:32 -0800

I am looking for a way to identify blocks of text that occur in several 
documents in a corpus for a research project with electronic medical records. 
They can be copied and pasted sections inserted into another document, text 
from a previous email in the corpus that is repeated in a follow-up email, text 
templates that get inserted into groups of documents, and occurrences of the 
same template more than once in the same document. Any of these duplicated text 
blocks may contain minor differences from one instance to another.


I read in a document called "What's new in Solr 1.4" that there has been 
support since 1.4 came out for duplicate text detection using the 
SignatureUpdateProcessor and TextProfileSignature classes. Can these be used to 
detect portions of documents that are alike or nearly alike, or are they 
intended to detect entire documents that are alike or nearly alike? Has 
additional support for duplicate detection been added to Solr since 1.4? It 
seems like some of the features of Solr and Lucene such as term positions and 
shingling could help in finding sections of matching or nearly matching text in 
documents. Does anyone have any experience in this area that they would be 
willing to share?
Thanks,
Mike

Identifying common text in documents

Reply via email to