Re: Strategy for re-indexing

Shawn Heisey Thu, 07 Oct 2010 04:46:45 -0700

 On 10/6/2010 10:49 AM, Allistair Crossley wrote:

Hi,


I was interested in gaining some insight into how you guys schedule updates for 
your Solr index (I have a single index).

Right now during development I have added deltaQuery specifications to data 
import entities to control the number of rows being queries on re-indexes.

However in terms of *when* to reindex we have a lot going on in the system - 
there are 4 sub-systems: custom application data, a CMS, a forum and a blog. 
It's all being indexed and at any given time there will be users and 
administrators all updating various parts of the sub-systems.

For the time being during development I have been issuing reindexes to the data 
import handler on each CRUD on any given sub-system. This has been working fine 
to be honest. It does need to be as immediate as possible - a scheduled update 
won't work for us. Even every 10 minutes is probably not fast enough.

So I wonder what others do. Is anyone else in a similar situation?

And what happens if 4 users generate 4 different requests to the data import 
handler to update for different types of data?  The DIH will be running already 
let's say for request 1, then request 2 comes in - is it rejected? Or is it 
queued?

I need it to be queued and serviced because the request 1 re-index may have 
already run its queries but missed the data added by the user for request 2. 
Same then goes for the requests 3 and 4.

I can't say whether the DIH will properly handle concurrent requests ornot. I figure it's always best to assume that things like this won'twork and find an elegant way to design around it.

I wrote my build system in perl (using LWP and LWP::Simple), and assumedthat the DIH would not let me run concurrent delta-imports. We settledon every two minutes for our update frequency, and use cron forscheduling. Two of my servers (VMs, actually) are a heartbeat clusterrunning HAProxy for load balancing, which I implemented purely forredundancy, not for scalability. Whichever host in the heartbeatcluster is online is the one that runs the cronjobs.


I have the following processes and schedules:

idxUpdate: Runs every two minutes. This script imports new data, basedon an autoincrement primary key in the database, the field is DID. Fromthe database perspective, changed data looks like new data - it gets itsDID updated but another unique field (TAG_ID) stays the same. Solr usesTAG_ID as its uniqueKey. Updates go into an incremental shard that isrelatively small - usually less than 1GB and 500,000 documents. At thetop of the hour, the update includes a call to optimize.

idxDelete: Runs every ten minutes starting at xx:01. This script getsthe list of newly deleted documents by DID. Then, 1024 of them at atime, it queries every shard for this list and issues a delete if theyare found. After the entire list is complete, it issues a commit to anyshard that was actually changed. This increases the lifespan ofindexSearchers and Solr caches. At the top of each hour, it reads theentire list of deletes instead of new ones, and trims the delete list tothe last 48 hours.

idxRrdUpdate: Runs once an hour. This simply records the currentMAX(DID) from the database into an RRD database. I keep it in both acounter and a gauge. One day I will track other statistical data aboutmy system and make it all into pretty graphs.

idxDistribute: Runs once a day. This uses the historical data in theRRD database to decide which incremental data is older than one week.Once it has that information (a DID range), it distributes those recordsto each of the six static index shards and deletes them from theincremental shard. If that process is successful, it updates the storedminimum DID value for the incremental. Each day, one of the staticindexes (currently 13GB and 7.6 million records) is optimized.

You might wonder how we deal with the fact that when a record ischanged, the old one might remain in the index for as long as 11 minutesbefore the delete process finally removes it. We assume that theincremental index, being less than 10% of the size of the staticindexes, will always respond faster. Since the updated copy of therecord will always be in the incremental, it should respond first to thedistributed query and therefore be the one that is included in theresults. That assumption seems to be correct so far.


Shawn

Re: Strategy for re-indexing

Reply via email to