etc documents with similar or copied sections.

Domènec Sos i Vallès Tue, 06 Sep 2011 02:47:15 -0700

Hello, our current goal is finding a solution for a translations company. Their 
issue is that very often they have to translate documents which have parts that 
have been copy & pasted from another document that was translated before, so 
they do the same work more than once.


I am a newcomer to Lucene/Solr, so I took the Solr tutorial (kudos to whoever 
contributed it, very good) and did some reading of Lucene in action and the 
Solr 3.1 cookbook.

My understanding of the solution is as follows, and I'd appreciate criticisms 
of whatever is wrong or missing, or alternate solutions, thanks in advance. At 
the end, I ask about potential issues I see right now.

- Add to schema.xml a field for the contents of the document, this must be 
stored and use termVectors (thanks, oh thy cookbook).

- Import the documents with Solr Cell and take care of routing the document 
contents (which may come in different fields depending on the import tool used 
by Tika) to the stored and termVector field.

- Store the document with a unique id (this is mandatory as the document is 
associated to an id in the main system).

- Do searchs on the unique id with the "more like this" commands in the URL.

My concerns about possible issues are:

- Performance: Will this work with thousands of documents containing from one 
page up to hundreds of pages?

- Correctness: If 10 out of 50 pages are copy & paste, shall we get at least 
20% similarity? Will this be higher than documents that may have the same words 
but in different positions?

Please note, in the subject I said "similar or copied sections" instead of some 
other name as, say, chapters which may require understanding of document 
structure. Sources of documents are very diverse and there is no easy way to 
find out any sort of structure.

Thanks for reading, thanks for replying.

Domènec Sos i Vallès

MoreLikeThis for finding PDF/Word/etc documents with similar or copied sections.

Reply via email to