On 10/6/2010 10:49 AM, Allistair Crossley wrote:
Hi,

I was interested in gaining some insight into how you guys schedule updates for 
your Solr index (I have a single index).

Right now during development I have added deltaQuery specifications to data 
import entities to control the number of rows being queries on re-indexes.

However in terms of *when* to reindex we have a lot going on in the system - 
there are 4 sub-systems: custom application data, a CMS, a forum and a blog. 
It's all being indexed and at any given time there will be users and 
administrators all updating various parts of the sub-systems.

For the time being during development I have been issuing reindexes to the data 
import handler on each CRUD on any given sub-system. This has been working fine 
to be honest. It does need to be as immediate as possible - a scheduled update 
won't work for us. Even every 10 minutes is probably not fast enough.

So I wonder what others do. Is anyone else in a similar situation?

And what happens if 4 users generate 4 different requests to the data import 
handler to update for different types of data?  The DIH will be running already 
let's say for request 1, then request 2 comes in - is it rejected? Or is it 
queued?

I need it to be queued and serviced because the request 1 re-index may have 
already run its queries but missed the data added by the user for request 2. 
Same then goes for the requests 3 and 4.

I can't say whether the DIH will properly handle concurrent requests or not. I figure it's always best to assume that things like this won't work and find an elegant way to design around it.

I wrote my build system in perl (using LWP and LWP::Simple), and assumed that the DIH would not let me run concurrent delta-imports. We settled on every two minutes for our update frequency, and use cron for scheduling. Two of my servers (VMs, actually) are a heartbeat cluster running HAProxy for load balancing, which I implemented purely for redundancy, not for scalability. Whichever host in the heartbeat cluster is online is the one that runs the cronjobs.

I have the following processes and schedules:

idxUpdate: Runs every two minutes. This script imports new data, based on an autoincrement primary key in the database, the field is DID. From the database perspective, changed data looks like new data - it gets its DID updated but another unique field (TAG_ID) stays the same. Solr uses TAG_ID as its uniqueKey. Updates go into an incremental shard that is relatively small - usually less than 1GB and 500,000 documents. At the top of the hour, the update includes a call to optimize.

idxDelete: Runs every ten minutes starting at xx:01. This script gets the list of newly deleted documents by DID. Then, 1024 of them at a time, it queries every shard for this list and issues a delete if they are found. After the entire list is complete, it issues a commit to any shard that was actually changed. This increases the lifespan of indexSearchers and Solr caches. At the top of each hour, it reads the entire list of deletes instead of new ones, and trims the delete list to the last 48 hours.

idxRrdUpdate: Runs once an hour. This simply records the current MAX(DID) from the database into an RRD database. I keep it in both a counter and a gauge. One day I will track other statistical data about my system and make it all into pretty graphs.

idxDistribute: Runs once a day. This uses the historical data in the RRD database to decide which incremental data is older than one week. Once it has that information (a DID range), it distributes those records to each of the six static index shards and deletes them from the incremental shard. If that process is successful, it updates the stored minimum DID value for the incremental. Each day, one of the static indexes (currently 13GB and 7.6 million records) is optimized.

You might wonder how we deal with the fact that when a record is changed, the old one might remain in the index for as long as 11 minutes before the delete process finally removes it. We assume that the incremental index, being less than 10% of the size of the static indexes, will always respond faster. Since the updated copy of the record will always be in the incremental, it should respond first to the distributed query and therefore be the one that is included in the results. That assumption seems to be correct so far.

Shawn

Reply via email to