: I have a few scaling questions I hope you might be able to answer for me.
: I'm keen to understand how solr performs under production loads with
: significant real-time update traffic.  Specifically,

These are all really good questions ... unfortunately I'm not sure that
I'm permitted to give out specific answers to some of them.  As far as
understanding Solr's ability to stand up under load, I'll see if I can get
some time/permission to run some benchmarks and publish the numbers (or
perhaps Yonik can do this as part of his prep for presenting at
ApacheConEU ... what do you think Yonik?)

: 1. How many searches per second are you currently handling?
: 2. How big is the solr fleet?

I'm going to have to put Q1 and Q2 in the "decline to state" category.

: 3. What is your update rate?

Hard to say ... I can tell you that our index contains roughly N
documents, and doing some greps of our logs I can see that on a typical
day our "master" server recieves about N/2 "<add>" commands ... but this
doesn't mean that half of our logical data is changing every day, most of
those updates are to the same logical documents over and over but it does
give you an idea of the amount of churn in lucene documents that's taking
place in a 24 hour period.

I should also point out that most of these updates are coming in big
spurts, but we've never encountered a situation where waiting for Solr to
index a document was a bottleneck -- pulling document data from our
primary datastore always takes longer then indexing the docs.

: 4. What is the propogation delay from master to slave, i.e. how often do you
: propogate and how long does it take per box?
: 5. What is your optimization schedule and how does it affect overall
: performance of the system?

The answers to Q4 and Q5 are related, and involve telling a much longer
story...

A year ago when we first started using Solr for faceted browsing, it had
Lucene 1.4.3 under the hood.  Our updating strategy involved issuing
commit commands after every batch of udpates (where a single batch was
never bigger then 2000 documents) with snapshooter configured in a
postCommit listener, and snappuller on the slaves running every 10
minutes.  We optimized twice a day, but while optimizing we disabled the
processes that sent updates because optimizing could easily take 20-30
minutes.  The index had thousands of indexed fields to support the
faceting we wanted and this was the cause of our biggest performance
issue: the space needed for all those field norms.  (Yonik implimented the
OMIT_NORMS option in lucene 1.9 to deal with this).

When we upgraded to Lucne 1.9 and started adding support for text
searching, our index got significantly smaller (even though we were adding
a lot of new tokenized fields) thanks to being able to turn off norms for
all of those existing faceting fields.  The other great thing about using
1.9 was that optimizing got a lot faster (I'm not certain if it's just
becuase of the reduced number of fields with norms, or if some other
improvement was made to how optimize works in lucene 1.9).  Optimizing our
index now typically only takes ~1 minute, the longest i've seen it take is
5 minutes.

While doing a lot of prelaunch profiling, we discovered that under extreme
loads, there was a huge differnce in the outliers between an optimized and
a non-optimized index -- we always knew querying an optimized index was
faster on average then querying an unoptimized index, we just didn't
realize how big the gap got when you looked at the non-average cases.

Sooo... since optimize times got so much shorter, and the benefits of
allways querying an optimized index were so easy to see, we changed the
solrconfig.xml for out master to only snapshoot on postOptimize, modified
our optimize cron to run every 30 minutes, and modified the snappuller
crons on the slaves to check for new snapshots more often (5 minutes i
think)

This means we are only ever snappulling complete copies of our index,
twice and hour.  So thetypical max delay in how long it takes for an
update on the master to show up on the slave is ~35 minutes -- the average
delay being 15-20 minutes

If we were concerned about reducing this delay we could, (even with our
current strategy of only pulling optimized indexes to the salves) but this
is faste enough for our purposes, and allows us to really take advantage
of the filterCaches on the slaves.


-Hoss

Reply via email to