Hi: I have a following use case. I have 5 sets of large different XML dump from 5 database. All the XML file has the same structure. However there are duplicate entries which are in different files with different ID's. (i.e. solr doc, 1 db entry = 1 solr doc). How can I remove duplicate where 3 or more condition must be met as well as 1 document must NOT be deleted. Example
file1.xml ======= <add> <doc> <field name="id">121</field> <field name="name">Acme inc</field> <field name="contact">[EMAIL PROTECTED]</field> </doc> <doc> <field name="id">122</field> <field name="name">ABC inc</field> <field name="contact">[EMAIL PROTECTED]</field> </doc> <doc> <field name="id">123</field> <field name="name">XYZ inc</field> <field name="contact">[EMAIL PROTECTED]</field> </doc> </add> file2.xml ====== <add> <doc> <field name="id">221</field> <field name="name">Acme inc</field> <field name="contact">[EMAIL PROTECTED]</field> </doc> <doc> <field name="id">222</field> <field name="name">BBC inc</field> <field name="contact">[EMAIL PROTECTED]</field> </doc> <doc> <field name="id">223</field> <field name="name">CNN inc</field> <field name="contact">[EMAIL PROTECTED]</field> </doc> </add> file3.xml ====== <add> <doc> <field name="id">321</field> <field name="name">NBC inc</field> <field name="contact">[EMAIL PROTECTED]</field> </doc> <doc> <field name="id">322</field> <field name="name">ABC inc</field> <field name="contact">[EMAIL PROTECTED]</field> </doc> <doc> <field name="id">323</field> <field name="name">BBC inc</field> <field name="contact">[EMAIL PROTECTED]</field> </doc> I have a field called last modified and that field will determine which record will get to be the one in Solr index. These files are huge and I need an automatic way of cleanup on a weekly basis. Yes, I can cleanup the files before handing over to Solr but I thought there must be some way to do it without writing custom modifications. Any tips/tricks very much appreciated. Regards