RE: Two Solr Announcements: CNET Product Search and DisMax
Chris, Cool stuff. Congratulations on launching. I have a few scaling questions I hope you might be able to answer for me. I'm keen to understand how solr performs under production loads with significant real-time update traffic. Specifically, 1. How many searches per second are you currently handling? 2. How big is the solr fleet? 3. What is your update rate? 4. What is the propogation delay from master to slave, i.e. how often do you propogate and how long does it take per box? 5. What is your optimization schedule and how does it affect overall performance of the system? If anyone else out there has similar data from large-scale experiences with solr, I'd love to hear those too. Thanks, -D -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Saturday, May 20, 2006 3:18 PM To: solr-user@lucene.apache.org Subject: Two Solr Announcements: CNET Product Search and DisMax I've got two related announcements to make, which I think are pretty cool... The first is that the Search result pages for CNET Shopper.com are now powered by Solr. You may be thinking "Didn't he announce that last year?" ... not quite. CNET's faceted product listing pages for browsing products by category have been powered by by Solr for about a year now, but up until a few weeks ago, searching for products by keywords was still powered by a legacy system. I was working hard to come up with a good mechanism for building Lucene queries based on user input, that would allow us to leverage our "domain expertise" about consumer technology products to ensure that users got the best matches. Which brings me to my second announcement: I've just committed a new SolrQueryHandler called the "DisMaxQueryHandler" into the Solr subversion repository. This query handler supports a simplified version of the Lucene QueryParser syntax. Quotes can be used to group phrases, and +/- can be used to denote mandatory and optional clauses ... but all other Lucene query parser special characters are escaped to simplify the user experience. The handler takes responsibility for building a good query from the user's input using BooleanQueries containing DisjunctionMaxQueries across fields and boosts you specify It also allows you to provide additional boosting queries, boosting functions, and filtering queries to artificially affect the outcome of all searches. These options can all be specified as init parameters for the handler in your solrconfig.xml or overridden the Solr query URL. The code in this plugin is what is now powering CNET product search. I've updated the "example" solrconfig.xml to take advantage of it, you can take it for a spin right now if you build from scratch using subversion, otherwise you'll have to wait for the solr-2006-05-21.zip nightly release due out in a few hours. Once you've got it, the javadocs for DisMaxRequestHandler contain the details about all of the options it supports, and here are a few URLs you can try out using the product data in the exampledocs directory... Normal results for the word "video" using the StandardRequestHandler with the default search field... http://localhost:8983/solr/select/?q=video&fl=name+score&qt=standard The "dismax" handler is configured to search across the text, features, name, sku, id, manu, and cat fields all with varying boosts designed to ensure that "better" matches appear first, specifically: documents which match on the name and cat fields get higher scores... http://localhost:8983/solr/select/?q=video&qt=dismax ...note that this instance is also configured with a default field list, which can be overridden in the URL... http://localhost:8983/solr/select/?q=video&qt=dismax&fl=*,score You can also override which fields are searched on, and how much boost each field gets... http://localhost:8983/solr/select/?q=video&qt=dismax&qf=features^20.0+text^0 .3 Another instance of the handler is registered using the qt "instock" and has slightly different configuration options, notably: a filter for (you guessed it) inStock:true)... http://localhost:8983/solr/select/?q=video&qt=dismax&fl=name,score,inStock http://localhost:8983/solr/select/?q=video&qt=instock&fl=name,score,inStock One of the other really cool features in this handler, is robust support for specifying the "BooleanQuery.minimumNumberShouldMatch" you want to be used based on how many terms are in your users query. These allows flexibility for typos and partial matches. For the dismax handler, 1 and 2 word queries require that all of the optional clauses match, but for 3-5 word queries one missing word is allowed... http://localhost:8983/solr/select/?q=belkin+ipod&qt=dismax http://localhost:8983/solr/select/?q=belkin+ipod+gibberish&qt=dismax http://localhost:8983/solr/select/?q=belkin+ipod+apple&qt=dismax Just like the StandardRequestHandler, it supports the debugQuery option to viewing the parsed query, and the score explanations for each doc... http://localhost:8
RE: Two Solr Announcements: CNET Product Search and DisMax
Thanks Hoss. This is really useful information. I understand you may not be able to answer 1 and 2 directly, so how about if I combine them into one question that doesn't require you to release quite as much information. Could you tell my how many tps you do per box, and a rough spec of what the boxes are? I.e. the ratio of the answers to questions 1 and 2. Thanks, -D -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Saturday, May 27, 2006 1:23 PM To: solr-user@lucene.apache.org Subject: RE: Two Solr Announcements: CNET Product Search and DisMax : I have a few scaling questions I hope you might be able to answer for me. : I'm keen to understand how solr performs under production loads with : significant real-time update traffic. Specifically, These are all really good questions ... unfortunately I'm not sure that I'm permitted to give out specific answers to some of them. As far as understanding Solr's ability to stand up under load, I'll see if I can get some time/permission to run some benchmarks and publish the numbers (or perhaps Yonik can do this as part of his prep for presenting at ApacheConEU ... what do you think Yonik?) : 1. How many searches per second are you currently handling? : 2. How big is the solr fleet? I'm going to have to put Q1 and Q2 in the "decline to state" category. : 3. What is your update rate? Hard to say ... I can tell you that our index contains roughly N documents, and doing some greps of our logs I can see that on a typical day our "master" server recieves about N/2 "" commands ... but this doesn't mean that half of our logical data is changing every day, most of those updates are to the same logical documents over and over but it does give you an idea of the amount of churn in lucene documents that's taking place in a 24 hour period. I should also point out that most of these updates are coming in big spurts, but we've never encountered a situation where waiting for Solr to index a document was a bottleneck -- pulling document data from our primary datastore always takes longer then indexing the docs. : 4. What is the propogation delay from master to slave, i.e. how often do you : propogate and how long does it take per box? : 5. What is your optimization schedule and how does it affect overall : performance of the system? The answers to Q4 and Q5 are related, and involve telling a much longer story... A year ago when we first started using Solr for faceted browsing, it had Lucene 1.4.3 under the hood. Our updating strategy involved issuing commit commands after every batch of udpates (where a single batch was never bigger then 2000 documents) with snapshooter configured in a postCommit listener, and snappuller on the slaves running every 10 minutes. We optimized twice a day, but while optimizing we disabled the processes that sent updates because optimizing could easily take 20-30 minutes. The index had thousands of indexed fields to support the faceting we wanted and this was the cause of our biggest performance issue: the space needed for all those field norms. (Yonik implimented the OMIT_NORMS option in lucene 1.9 to deal with this). When we upgraded to Lucne 1.9 and started adding support for text searching, our index got significantly smaller (even though we were adding a lot of new tokenized fields) thanks to being able to turn off norms for all of those existing faceting fields. The other great thing about using 1.9 was that optimizing got a lot faster (I'm not certain if it's just becuase of the reduced number of fields with norms, or if some other improvement was made to how optimize works in lucene 1.9). Optimizing our index now typically only takes ~1 minute, the longest i've seen it take is 5 minutes. While doing a lot of prelaunch profiling, we discovered that under extreme loads, there was a huge differnce in the outliers between an optimized and a non-optimized index -- we always knew querying an optimized index was faster on average then querying an unoptimized index, we just didn't realize how big the gap got when you looked at the non-average cases. Sooo... since optimize times got so much shorter, and the benefits of allways querying an optimized index were so easy to see, we changed the solrconfig.xml for out master to only snapshoot on postOptimize, modified our optimize cron to run every 30 minutes, and modified the snappuller crons on the slaves to check for new snapshots more often (5 minutes i think) This means we are only ever snappulling complete copies of our index, twice and hour. So thetypical max delay in how long it takes for an update on the master to show up on the slave is ~35 minutes -- the average delay being 15-20 minutes If we were concerned about reducing this delay we could, (even with our current strategy of only pulling optimized indexes to the salves) but this is faste enough for our purposes, and allows us to really take advantage of the filterCaches on the slaves.
RE: solr newbie
You can download curl from http://curl.haxx.se/ if you don't have it on your machine. -D -Original Message- From: Tim Archambault [mailto:[EMAIL PROTECTED] Sent: Thursday, June 01, 2006 9:42 AM To: solr-user@lucene.apache.org Subject: solr newbie Trying to run the test tutorial to index an xml file and keep getting an error message: curl: command not found? Any help is greatly appreciated.
RE: solr newbie
I wrote just such a client within the last 24h to support load-testing Solr for my application. The client stub is simple and independent of my particular application, so it would be easy for me to contribute it if there is interest. It has methods to add() a document or collection of documents, and commit() and optimize(). -D -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: Thursday, June 01, 2006 10:44 AM To: solr-user@lucene.apache.org Subject: Re: solr newbie We don't have it yet, but there really should be a simple Java client library that creates the XML add commands and handles sending them to the server.
RE: solr newbie
See http://issues.apache.org/jira/browse/SOLR-20. -D -Original Message- From: Yonik Seeley [mailto:[EMAIL PROTECTED] Sent: Thursday, June 01, 2006 10:15 PM To: solr-user@lucene.apache.org Subject: Re: solr newbie On 6/2/06, Darren Vengroff <[EMAIL PROTECTED]> wrote: > I wrote just such a client within the last 24h to support load-testing Solr > for my application. The client stub is simple and independent of my > particular application, so it would be easy for me to contribute it if there > is interest. It has methods to add() a document or collection of documents, > and commit() and optimize(). It would be great to see what you have! If you would like to contribute it, or get feedback on the API, please open a new JIRA bug (feature) and add the code there. -Yonik
RE: client code for searching?
I've been meaning to write some companion code to do searching. I haven't needed to search from Java yet, so I haven't written it. Expect it in a few weeks given my current schedule. -D -Original Message- From: Brian Lucas [mailto:[EMAIL PROTECTED] Sent: Friday, July 14, 2006 12:10 PM To: solr-user@lucene.apache.org Subject: RE: client code for searching? http://wiki.apache.org/solr/SolJava -Original Message- From: WHIRLYCOTT [mailto:[EMAIL PROTECTED] Sent: Friday, July 14, 2006 1:03 PM To: solr-user@lucene.apache.org Subject: Re: client code for searching? I did before I sent the email to the list, actually. Is there something specific on the wiki that you're able to point me at? phil. On Jul 14, 2006, at 3:00 PM, Brian Lucas wrote: > Check the wiki, my friend. > http://wiki.apache.org/solr > > > > -Original Message- > From: WHIRLYCOTT [mailto:[EMAIL PROTECTED] > Sent: Friday, July 14, 2006 12:35 PM > To: solr-user@lucene.apache.org > Subject: Re: client code for searching? > > Yes, I need java, but I would be eager to read your python code to > get some design ideas from it. > > phil. > > On Jul 14, 2006, at 2:32 PM, Mike Klaas wrote: > >> On 7/14/06, WHIRLYCOTT <[EMAIL PROTECTED]> wrote: >>> Does anybody have some client code for performing searches against a >>> Solr installation? I've seen the DocumentManagerClient for adding/ >>> dropping/etc docs from the index, but I don't see any client code in >>> svn anywhere. >> >> I've written some client code for doing such in python--I assume >> you're looking for java? >> >> Did the list reach a consensus on where client for various languages >> fit into the grand scheme of things? >> >> -Mike > > > -- > Whirlycott > Philip Jacob > [EMAIL PROTECTED] > http://www.whirlycott.com/phil/ > > -- Whirlycott Philip Jacob [EMAIL PROTECTED] http://www.whirlycott.com/phil/
RE: Lucene versioning policy
Any update on possible graduation from the incubator? Any chance it could coincide with Hoss's presentation at ApacheCon? -D -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Thursday, June 08, 2006 11:48 PM To: solr-user@lucene.apache.org Subject: Re: Lucene versioning policy : Also, are there any plans to split solr into a release/development mode? : : I'd really like to use solr in a commercial setting, but having nothing but : nightly builds available makes me uneasy. I believe that as long as Solr is in the incubator, nightly builds are the only releases we are allowed to have. This is a side note in the incubation policy about exiting incubation... Note: incubator projects are not permitted to issue an official Release. Test snapshots (however good the quality) and Release plans are OK. ...of course, there is some conflicting info higher up in the same doc that suggests they are allowed, but they require jumping through some hoops... http://incubator.apache.org/incubation/Incubation_Policy.html#Releases -Hoss