Chris: Thanks for the reply.
On 11/27/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: I have a field called last modified and that field will determine : which record will get to be the one in Solr index. : : These files are huge and I need an automatic way of cleanup on a : weekly basis. Yes, I can cleanup the files before handing over to Solr : but I thought there must be some way to do it without writing custom : modifications. First off: keep in mind that you don't *need* to create files on disk ... you said all of this data was being dumped fro ma databse ... instead of dumping to files, stream the data out of your DB and send it to Solr directly.
Eventually we will be able to do that. Currently this data is coming from external parties (4 different external vendor) so I have no control over it. We have discussed the issue as you pointed out. We have thought about i.e. running web service out from vendor db to Solr. But security from various corporate standpoint is a issue. -- now your "deduping" problem becomes something you can manage
in your database (or in the code steaming hte data out of the DB.
Same as above. I don't know whats in the database i.e. content in advance.
that said, if you'd still like Solr to handle the "deduping' for you then the 3 fields that define a duplicate record need to be combined into a single field which you defie as your uniqueKey ... you could concat them if they are simple enough, or you could use something like an md5sum ... if the docs are added in order of lastModifiedDate, Solr will ensure the most recent one "sticks"
Thanks. This is what I wanted to know. Excellent. I will give this a try. Cheers.