RE: Two Solr Announcements: CNET Product Search and DisMax

Darren Vengroff Fri, 26 May 2006 17:15:31 -0700

Chris,

Cool stuff.  Congratulations on launching.


I have a few scaling questions I hope you might be able to answer for me.
I'm keen to understand how solr performs under production loads with
significant real-time update traffic.  Specifically,

1. How many searches per second are you currently handling?
2. How big is the solr fleet?
3. What is your update rate?
4. What is the propogation delay from master to slave, i.e. how often do you
propogate and how long does it take per box?
5. What is your optimization schedule and how does it affect overall
performance of the system?

If anyone else out there has similar data from large-scale experiences with
solr, I'd love to hear those too.

Thanks,

-D

-----Original Message-----
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Saturday, May 20, 2006 3:18 PM
To: solr-user@lucene.apache.org
Subject: Two Solr Announcements: CNET Product Search and DisMax


I've got two related announcements to make, which I think are pretty
cool...

The first is that the Search result pages for CNET Shopper.com are now
powered by Solr.  You may be thinking "Didn't he announce that last year?"
... not quite.  CNET's faceted product listing pages for browsing products
by category have been powered by by Solr for about a year now, but up
until a few weeks ago, searching for products by keywords was still
powered by a legacy system.  I was working hard to come up with a good
mechanism for building Lucene queries based on user input, that would
allow us to leverage our "domain expertise" about consumer technology
products to ensure that users got the best matches.

Which brings me to my second announcement:  I've just committed a new
SolrQueryHandler called the "DisMaxQueryHandler" into the Solr subversion
repository.

This query handler supports a simplified version of the Lucene QueryParser
syntax.  Quotes can be used to group phrases, and +/- can be used to
denote mandatory and optional clauses ... but all other Lucene query
parser special characters are escaped to simplify the user experience.
The handler takes responsibility for building a good query from the user's
input using BooleanQueries containing DisjunctionMaxQueries across fields
and boosts you specify It also allows you to provide additional boosting
queries, boosting functions, and filtering queries to artificially affect
the outcome of all searches. These options can all be specified as init
parameters for the handler in your solrconfig.xml or overridden the Solr
query URL.

The code in this plugin is what is now powering CNET product search.

I've updated the "example" solrconfig.xml to take advantage of it, you can
take it for a spin right now if you build from scratch using subversion,
otherwise you'll have to wait for the solr-2006-05-21.zip nightly release
due out in a few hours.  Once you've got it, the javadocs for
DisMaxRequestHandler contain the details about all of the options it
supports, and here are a few URLs you can try out using the product data
in the exampledocs directory...

Normal results for the word "video" using the StandardRequestHandler with
the default search field...
  http://localhost:8983/solr/select/?q=video&fl=name+score&qt=standard

The "dismax" handler is configured to search across the text, features,
name, sku, id, manu, and cat fields all with varying boosts designed to
ensure that "better" matches appear first, specifically: documents which
match on the name and cat fields get higher scores...
  http://localhost:8983/solr/select/?q=video&qt=dismax

...note that this instance is also configured with a default field list,
which can be overridden in the URL...
  http://localhost:8983/solr/select/?q=video&qt=dismax&fl=*,score

You can also override which fields are searched on, and how much boost
each field gets...
 
http://localhost:8983/solr/select/?q=video&qt=dismax&qf=features^20.0+text^0
.3

Another instance of the handler is registered using the qt "instock" and
has slightly different configuration options, notably: a filter for (you
guessed it) inStock:true)...
  http://localhost:8983/solr/select/?q=video&qt=dismax&fl=name,score,inStock
 
http://localhost:8983/solr/select/?q=video&qt=instock&fl=name,score,inStock

One of the other really cool features in this handler, is robust
support for specifying the "BooleanQuery.minimumNumberShouldMatch" you
want to be used based on how many terms are in your users query.
These allows flexibility for typos and partial matches.  For the
dismax handler, 1 and 2 word queries require that all of the optional
clauses match, but for 3-5 word queries one missing word is allowed...
  http://localhost:8983/solr/select/?q=belkin+ipod&qt=dismax
  http://localhost:8983/solr/select/?q=belkin+ipod+gibberish&qt=dismax
  http://localhost:8983/solr/select/?q=belkin+ipod+apple&qt=dismax

Just like the StandardRequestHandler, it supports the debugQuery
option to viewing the parsed query, and the score explanations for each
doc...

http://localhost:8983/solr/select/?q=belkin+ipod+gibberish&qt=dismax&debugQu
ery=1
  http://localhost:8983/solr/select/?q=video+card&qt=dismax&debugQuery=1


...That's the overall gist of it.  I hope other people find it useful
out of the box -- and even if it doesn't meet your needs, hopefully it
gives you some good ideas of the types of things that can be done in a
SolrRequestHandler that aren't supported natively with the Lucene
QueryParser.  If you do decide to write your own handler, make sure to
take a look at the new SolrPluginUtils class as well -- it provides
some nice reusable methods that came in handy when writing the
DisMaxRequestHandler.




-Hoss

RE: Two Solr Announcements: CNET Product Search and DisMax

Reply via email to