Thanks Hoss. This is really useful information. I understand you may not be able to answer 1 and 2 directly, so how about if I combine them into one question that doesn't require you to release quite as much information. Could you tell my how many tps you do per box, and a rough spec of what the boxes are? I.e. the ratio of the answers to questions 1 and 2.
Thanks, -D -----Original Message----- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Saturday, May 27, 2006 1:23 PM To: solr-user@lucene.apache.org Subject: RE: Two Solr Announcements: CNET Product Search and DisMax : I have a few scaling questions I hope you might be able to answer for me. : I'm keen to understand how solr performs under production loads with : significant real-time update traffic. Specifically, These are all really good questions ... unfortunately I'm not sure that I'm permitted to give out specific answers to some of them. As far as understanding Solr's ability to stand up under load, I'll see if I can get some time/permission to run some benchmarks and publish the numbers (or perhaps Yonik can do this as part of his prep for presenting at ApacheConEU ... what do you think Yonik?) : 1. How many searches per second are you currently handling? : 2. How big is the solr fleet? I'm going to have to put Q1 and Q2 in the "decline to state" category. : 3. What is your update rate? Hard to say ... I can tell you that our index contains roughly N documents, and doing some greps of our logs I can see that on a typical day our "master" server recieves about N/2 "<add>" commands ... but this doesn't mean that half of our logical data is changing every day, most of those updates are to the same logical documents over and over but it does give you an idea of the amount of churn in lucene documents that's taking place in a 24 hour period. I should also point out that most of these updates are coming in big spurts, but we've never encountered a situation where waiting for Solr to index a document was a bottleneck -- pulling document data from our primary datastore always takes longer then indexing the docs. : 4. What is the propogation delay from master to slave, i.e. how often do you : propogate and how long does it take per box? : 5. What is your optimization schedule and how does it affect overall : performance of the system? The answers to Q4 and Q5 are related, and involve telling a much longer story... A year ago when we first started using Solr for faceted browsing, it had Lucene 1.4.3 under the hood. Our updating strategy involved issuing commit commands after every batch of udpates (where a single batch was never bigger then 2000 documents) with snapshooter configured in a postCommit listener, and snappuller on the slaves running every 10 minutes. We optimized twice a day, but while optimizing we disabled the processes that sent updates because optimizing could easily take 20-30 minutes. The index had thousands of indexed fields to support the faceting we wanted and this was the cause of our biggest performance issue: the space needed for all those field norms. (Yonik implimented the OMIT_NORMS option in lucene 1.9 to deal with this). When we upgraded to Lucne 1.9 and started adding support for text searching, our index got significantly smaller (even though we were adding a lot of new tokenized fields) thanks to being able to turn off norms for all of those existing faceting fields. The other great thing about using 1.9 was that optimizing got a lot faster (I'm not certain if it's just becuase of the reduced number of fields with norms, or if some other improvement was made to how optimize works in lucene 1.9). Optimizing our index now typically only takes ~1 minute, the longest i've seen it take is 5 minutes. While doing a lot of prelaunch profiling, we discovered that under extreme loads, there was a huge differnce in the outliers between an optimized and a non-optimized index -- we always knew querying an optimized index was faster on average then querying an unoptimized index, we just didn't realize how big the gap got when you looked at the non-average cases. Sooo... since optimize times got so much shorter, and the benefits of allways querying an optimized index were so easy to see, we changed the solrconfig.xml for out master to only snapshoot on postOptimize, modified our optimize cron to run every 30 minutes, and modified the snappuller crons on the slaves to check for new snapshots more often (5 minutes i think) This means we are only ever snappulling complete copies of our index, twice and hour. So thetypical max delay in how long it takes for an update on the master to show up on the slave is ~35 minutes -- the average delay being 15-20 minutes If we were concerned about reducing this delay we could, (even with our current strategy of only pulling optimized indexes to the salves) but this is faste enough for our purposes, and allows us to really take advantage of the filterCaches on the slaves. -Hoss