Re: duplicate deleting

Zaheed Haque Mon, 27 Nov 2006 12:35:38 -0800

Chris:

Thanks for the reply.


On 11/27/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: I have a field called last modified and that field will determine
: which record will get to be the one in Solr index.
:
: These files are huge and I need an automatic way of cleanup on a
: weekly basis. Yes, I can cleanup the files before handing over to Solr
: but I thought there must be some way to do it without writing custom
: modifications.

First off: keep in mind that you don't *need* to create files on disk ...
you said all of this data was being dumped fro ma databse ... instead of
dumping to files, stream the data out of your DB and send it to Solr
directly.


Eventually we will be able to do that. Currently this data is coming from
external parties (4 different external vendor) so I have no control over it.
We have discussed the issue as you pointed out. We have thought about
i.e. running web service out from vendor db to Solr. But security from
various corporate standpoint is a issue.

-- now your "deduping" problem becomes something you can manage

in your database (or in the code steaming hte data out of the DB.


Same as above. I don't know whats in the database i.e. content in advance.

that said, if you'd still like Solr to handle the "deduping' for you then
the 3 fields that define a duplicate record need to be combined into a
single field which you defie as your uniqueKey ... you could concat them
if they are simple enough, or you could use something like an md5sum ...
if the docs are added in order of lastModifiedDate, Solr will ensure the
most recent one "sticks"

Thanks. This is what I wanted to know. Excellent. I will give this a try.

Cheers.

Re: duplicate deleting

Reply via email to