: I have a field called last modified and that field will determine
: which record will get to be the one in Solr index.
:
: These files are huge and I need an automatic way of cleanup on a
: weekly basis. Yes, I can cleanup the files before handing over to Solr
: but I thought there must be some way to do it without writing custom
: modifications.

First off: keep in mind that you don't *need* to create files on disk ...
you said all of this data was being dumped fro ma databse ... instead of
dumping to files, stream the data out of your DB and send it to Solr
directly -- now your "deduping" problem becomes something you can manage
in your database (or in the code steaming hte data out of the DB.

that said, if you'd still like Solr to handle the "deduping' for you then
the 3 fields that define a duplicate record need to be combined into a
single field which you defie as your uniqueKey ... you could concat them
if they are simple enough, or you could use something like an md5sum ...
if the docs are added in order of lastModifiedDate, Solr will ensure the
most recent one "sticks"


...this doesn't help you clean up the files on disk though, but like i
said: yoy don't need the files to be on disk.


-Hoss

Reply via email to