On 10/6/2010 10:49 AM, Allistair Crossley wrote:
Hi,
I was interested in gaining some insight into how you guys schedule updates for
your Solr index (I have a single index).
Right now during development I have added deltaQuery specifications to data
import entities to control the number of rows being queries on re-indexes.
However in terms of *when* to reindex we have a lot going on in the system -
there are 4 sub-systems: custom application data, a CMS, a forum and a blog.
It's all being indexed and at any given time there will be users and
administrators all updating various parts of the sub-systems.
For the time being during development I have been issuing reindexes to the data
import handler on each CRUD on any given sub-system. This has been working fine
to be honest. It does need to be as immediate as possible - a scheduled update
won't work for us. Even every 10 minutes is probably not fast enough.
So I wonder what others do. Is anyone else in a similar situation?
And what happens if 4 users generate 4 different requests to the data import
handler to update for different types of data? The DIH will be running already
let's say for request 1, then request 2 comes in - is it rejected? Or is it
queued?
I need it to be queued and serviced because the request 1 re-index may have
already run its queries but missed the data added by the user for request 2.
Same then goes for the requests 3 and 4.
I can't say whether the DIH will properly handle concurrent requests or
not. I figure it's always best to assume that things like this won't
work and find an elegant way to design around it.
I wrote my build system in perl (using LWP and LWP::Simple), and assumed
that the DIH would not let me run concurrent delta-imports. We settled
on every two minutes for our update frequency, and use cron for
scheduling. Two of my servers (VMs, actually) are a heartbeat cluster
running HAProxy for load balancing, which I implemented purely for
redundancy, not for scalability. Whichever host in the heartbeat
cluster is online is the one that runs the cronjobs.
I have the following processes and schedules:
idxUpdate: Runs every two minutes. This script imports new data, based
on an autoincrement primary key in the database, the field is DID. From
the database perspective, changed data looks like new data - it gets its
DID updated but another unique field (TAG_ID) stays the same. Solr uses
TAG_ID as its uniqueKey. Updates go into an incremental shard that is
relatively small - usually less than 1GB and 500,000 documents. At the
top of the hour, the update includes a call to optimize.
idxDelete: Runs every ten minutes starting at xx:01. This script gets
the list of newly deleted documents by DID. Then, 1024 of them at a
time, it queries every shard for this list and issues a delete if they
are found. After the entire list is complete, it issues a commit to any
shard that was actually changed. This increases the lifespan of
indexSearchers and Solr caches. At the top of each hour, it reads the
entire list of deletes instead of new ones, and trims the delete list to
the last 48 hours.
idxRrdUpdate: Runs once an hour. This simply records the current
MAX(DID) from the database into an RRD database. I keep it in both a
counter and a gauge. One day I will track other statistical data about
my system and make it all into pretty graphs.
idxDistribute: Runs once a day. This uses the historical data in the
RRD database to decide which incremental data is older than one week.
Once it has that information (a DID range), it distributes those records
to each of the six static index shards and deletes them from the
incremental shard. If that process is successful, it updates the stored
minimum DID value for the incremental. Each day, one of the static
indexes (currently 13GB and 7.6 million records) is optimized.
You might wonder how we deal with the fact that when a record is
changed, the old one might remain in the index for as long as 11 minutes
before the delete process finally removes it. We assume that the
incremental index, being less than 10% of the size of the static
indexes, will always respond faster. Since the updated copy of the
record will always be in the incremental, it should respond first to the
distributed query and therefore be the one that is included in the
results. That assumption seems to be correct so far.
Shawn