Hi,

Currently we import data-sets from various sources (csv, xml, json, etc.) and POST to Solr, after some pre-processing to get it into a consistent format, and some other transformations.

We currently dump out to a json file in batches of 1,000 documents and POST that file to Solr.

Roughly 50m documents come in throughout the day, and are fully re-indexed. Following the update calls, we then delete any docs based on a last_seen datetime field, which removes documents before the most recent run, related to that run.

I'm now importing our raw data firstly into MongoDB, in raw format. The data will then be translated and stored in another Mongo collection. These 2 steps are for business reasons.

That final Mongo collection then needs to be sent to Solr.

My question is whether sending batches of 1,000 documents to Solr is still beneficial (thinking about docs that may not change), or if I should look at the MongoDB connector for Solr, based on the volume of incoming data we see.

Would the connector still see all docs updating if I re-insert them blindly, and thus still send all 50m documents back to Solr everyday anyway?

Is my setup quite typical for the MongoDB connector?

Thanks,
Rob



Reply via email to