Hello, our current goal is finding a solution for a translations company. Their issue is that very often they have to translate documents which have parts that have been copy & pasted from another document that was translated before, so they do the same work more than once.
I am a newcomer to Lucene/Solr, so I took the Solr tutorial (kudos to whoever contributed it, very good) and did some reading of Lucene in action and the Solr 3.1 cookbook. My understanding of the solution is as follows, and I'd appreciate criticisms of whatever is wrong or missing, or alternate solutions, thanks in advance. At the end, I ask about potential issues I see right now. - Add to schema.xml a field for the contents of the document, this must be stored and use termVectors (thanks, oh thy cookbook). - Import the documents with Solr Cell and take care of routing the document contents (which may come in different fields depending on the import tool used by Tika) to the stored and termVector field. - Store the document with a unique id (this is mandatory as the document is associated to an id in the main system). - Do searchs on the unique id with the "more like this" commands in the URL. My concerns about possible issues are: - Performance: Will this work with thousands of documents containing from one page up to hundreds of pages? - Correctness: If 10 out of 50 pages are copy & paste, shall we get at least 20% similarity? Will this be higher than documents that may have the same words but in different positions? Please note, in the subject I said "similar or copied sections" instead of some other name as, say, chapters which may require understanding of document structure. Sources of documents are very diverse and there is no easy way to find out any sort of structure. Thanks for reading, thanks for replying. Domènec Sos i Vallès