RE: Two Solr Announcements: CNET Product Search and DisMax

2006-05-26 Thread Darren Vengroff
Chris,

Cool stuff.  Congratulations on launching.

I have a few scaling questions I hope you might be able to answer for me.
I'm keen to understand how solr performs under production loads with
significant real-time update traffic.  Specifically,

1. How many searches per second are you currently handling?
2. How big is the solr fleet?
3. What is your update rate?
4. What is the propogation delay from master to slave, i.e. how often do you
propogate and how long does it take per box?
5. What is your optimization schedule and how does it affect overall
performance of the system?

If anyone else out there has similar data from large-scale experiences with
solr, I'd love to hear those too.

Thanks,

-D

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Saturday, May 20, 2006 3:18 PM
To: solr-user@lucene.apache.org
Subject: Two Solr Announcements: CNET Product Search and DisMax


I've got two related announcements to make, which I think are pretty
cool...

The first is that the Search result pages for CNET Shopper.com are now
powered by Solr.  You may be thinking "Didn't he announce that last year?"
... not quite.  CNET's faceted product listing pages for browsing products
by category have been powered by by Solr for about a year now, but up
until a few weeks ago, searching for products by keywords was still
powered by a legacy system.  I was working hard to come up with a good
mechanism for building Lucene queries based on user input, that would
allow us to leverage our "domain expertise" about consumer technology
products to ensure that users got the best matches.

Which brings me to my second announcement:  I've just committed a new
SolrQueryHandler called the "DisMaxQueryHandler" into the Solr subversion
repository.

This query handler supports a simplified version of the Lucene QueryParser
syntax.  Quotes can be used to group phrases, and +/- can be used to
denote mandatory and optional clauses ... but all other Lucene query
parser special characters are escaped to simplify the user experience.
The handler takes responsibility for building a good query from the user's
input using BooleanQueries containing DisjunctionMaxQueries across fields
and boosts you specify It also allows you to provide additional boosting
queries, boosting functions, and filtering queries to artificially affect
the outcome of all searches. These options can all be specified as init
parameters for the handler in your solrconfig.xml or overridden the Solr
query URL.

The code in this plugin is what is now powering CNET product search.

I've updated the "example" solrconfig.xml to take advantage of it, you can
take it for a spin right now if you build from scratch using subversion,
otherwise you'll have to wait for the solr-2006-05-21.zip nightly release
due out in a few hours.  Once you've got it, the javadocs for
DisMaxRequestHandler contain the details about all of the options it
supports, and here are a few URLs you can try out using the product data
in the exampledocs directory...

Normal results for the word "video" using the StandardRequestHandler with
the default search field...
  http://localhost:8983/solr/select/?q=video&fl=name+score&qt=standard

The "dismax" handler is configured to search across the text, features,
name, sku, id, manu, and cat fields all with varying boosts designed to
ensure that "better" matches appear first, specifically: documents which
match on the name and cat fields get higher scores...
  http://localhost:8983/solr/select/?q=video&qt=dismax

...note that this instance is also configured with a default field list,
which can be overridden in the URL...
  http://localhost:8983/solr/select/?q=video&qt=dismax&fl=*,score

You can also override which fields are searched on, and how much boost
each field gets...
 
http://localhost:8983/solr/select/?q=video&qt=dismax&qf=features^20.0+text^0
.3

Another instance of the handler is registered using the qt "instock" and
has slightly different configuration options, notably: a filter for (you
guessed it) inStock:true)...
  http://localhost:8983/solr/select/?q=video&qt=dismax&fl=name,score,inStock
 
http://localhost:8983/solr/select/?q=video&qt=instock&fl=name,score,inStock

One of the other really cool features in this handler, is robust
support for specifying the "BooleanQuery.minimumNumberShouldMatch" you
want to be used based on how many terms are in your users query.
These allows flexibility for typos and partial matches.  For the
dismax handler, 1 and 2 word queries require that all of the optional
clauses match, but for 3-5 word queries one missing word is allowed...
  http://localhost:8983/solr/select/?q=belkin+ipod&qt=dismax
  http://localhost:8983/solr/select/?q=belkin+ipod+gibberish&qt=dismax
  http://localhost:8983/solr/select/?q=belkin+ipod+apple&qt=dismax

Just like the StandardRequestHandler, it supports the debugQuery
option to viewing the parsed query, and the score explanations for each
doc...

http://localhost:8

RE: Two Solr Announcements: CNET Product Search and DisMax

2006-05-28 Thread Darren Vengroff
Thanks Hoss.  This is really useful information.

I understand you may not be able to answer 1 and 2 directly, so how about if
I combine them into one question that doesn't require you to release quite
as much information.  Could you tell my how many tps you do per box, and a
rough spec of what the boxes are?  I.e. the ratio of the answers to
questions 1 and 2.

Thanks,

-D

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Saturday, May 27, 2006 1:23 PM
To: solr-user@lucene.apache.org
Subject: RE: Two Solr Announcements: CNET Product Search and DisMax


: I have a few scaling questions I hope you might be able to answer for me.
: I'm keen to understand how solr performs under production loads with
: significant real-time update traffic.  Specifically,

These are all really good questions ... unfortunately I'm not sure that
I'm permitted to give out specific answers to some of them.  As far as
understanding Solr's ability to stand up under load, I'll see if I can get
some time/permission to run some benchmarks and publish the numbers (or
perhaps Yonik can do this as part of his prep for presenting at
ApacheConEU ... what do you think Yonik?)

: 1. How many searches per second are you currently handling?
: 2. How big is the solr fleet?

I'm going to have to put Q1 and Q2 in the "decline to state" category.

: 3. What is your update rate?

Hard to say ... I can tell you that our index contains roughly N
documents, and doing some greps of our logs I can see that on a typical
day our "master" server recieves about N/2 "" commands ... but this
doesn't mean that half of our logical data is changing every day, most of
those updates are to the same logical documents over and over but it does
give you an idea of the amount of churn in lucene documents that's taking
place in a 24 hour period.

I should also point out that most of these updates are coming in big
spurts, but we've never encountered a situation where waiting for Solr to
index a document was a bottleneck -- pulling document data from our
primary datastore always takes longer then indexing the docs.

: 4. What is the propogation delay from master to slave, i.e. how often do
you
: propogate and how long does it take per box?
: 5. What is your optimization schedule and how does it affect overall
: performance of the system?

The answers to Q4 and Q5 are related, and involve telling a much longer
story...

A year ago when we first started using Solr for faceted browsing, it had
Lucene 1.4.3 under the hood.  Our updating strategy involved issuing
commit commands after every batch of udpates (where a single batch was
never bigger then 2000 documents) with snapshooter configured in a
postCommit listener, and snappuller on the slaves running every 10
minutes.  We optimized twice a day, but while optimizing we disabled the
processes that sent updates because optimizing could easily take 20-30
minutes.  The index had thousands of indexed fields to support the
faceting we wanted and this was the cause of our biggest performance
issue: the space needed for all those field norms.  (Yonik implimented the
OMIT_NORMS option in lucene 1.9 to deal with this).

When we upgraded to Lucne 1.9 and started adding support for text
searching, our index got significantly smaller (even though we were adding
a lot of new tokenized fields) thanks to being able to turn off norms for
all of those existing faceting fields.  The other great thing about using
1.9 was that optimizing got a lot faster (I'm not certain if it's just
becuase of the reduced number of fields with norms, or if some other
improvement was made to how optimize works in lucene 1.9).  Optimizing our
index now typically only takes ~1 minute, the longest i've seen it take is
5 minutes.

While doing a lot of prelaunch profiling, we discovered that under extreme
loads, there was a huge differnce in the outliers between an optimized and
a non-optimized index -- we always knew querying an optimized index was
faster on average then querying an unoptimized index, we just didn't
realize how big the gap got when you looked at the non-average cases.

Sooo... since optimize times got so much shorter, and the benefits of
allways querying an optimized index were so easy to see, we changed the
solrconfig.xml for out master to only snapshoot on postOptimize, modified
our optimize cron to run every 30 minutes, and modified the snappuller
crons on the slaves to check for new snapshots more often (5 minutes i
think)

This means we are only ever snappulling complete copies of our index,
twice and hour.  So thetypical max delay in how long it takes for an
update on the master to show up on the slave is ~35 minutes -- the average
delay being 15-20 minutes

If we were concerned about reducing this delay we could, (even with our
current strategy of only pulling optimized indexes to the salves) but this
is faste enough for our purposes, and allows us to really take advantage
of the filterCaches on the slaves.

RE: solr newbie

2006-06-01 Thread Darren Vengroff
You can download curl from http://curl.haxx.se/ if you don't have it on your
machine.

-D

-Original Message-
From: Tim Archambault [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 01, 2006 9:42 AM
To: solr-user@lucene.apache.org
Subject: solr newbie

Trying to run the test tutorial to index an xml file and keep getting an
error message: curl: command not found?

Any help is greatly appreciated.



RE: solr newbie

2006-06-01 Thread Darren Vengroff
I wrote just such a client within the last 24h to support load-testing Solr
for my application.  The client stub is simple and independent of my
particular application, so it would be easy for me to contribute it if there
is interest.  It has methods to add() a document or collection of documents,
and commit() and optimize().

-D

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 01, 2006 10:44 AM
To: solr-user@lucene.apache.org
Subject: Re: solr newbie

We don't have it yet, but there really should be a simple Java client
library that creates the XML add commands and handles sending them to
the server.



RE: solr newbie

2006-06-01 Thread Darren Vengroff
See http://issues.apache.org/jira/browse/SOLR-20.

-D

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 01, 2006 10:15 PM
To: solr-user@lucene.apache.org
Subject: Re: solr newbie

On 6/2/06, Darren Vengroff <[EMAIL PROTECTED]> wrote:
> I wrote just such a client within the last 24h to support load-testing
Solr
> for my application.  The client stub is simple and independent of my
> particular application, so it would be easy for me to contribute it if
there
> is interest.  It has methods to add() a document or collection of
documents,
> and commit() and optimize().

It would be great to see what you have!
If you would like to contribute it, or get feedback on the API, please
open a new JIRA bug (feature) and add the code there.

-Yonik



RE: client code for searching?

2006-07-15 Thread Darren Vengroff
I've been meaning to write some companion code to do searching.  I haven't
needed to search from Java yet, so I haven't written it.  Expect it in a few
weeks given my current schedule.

-D

-Original Message-
From: Brian Lucas [mailto:[EMAIL PROTECTED] 
Sent: Friday, July 14, 2006 12:10 PM
To: solr-user@lucene.apache.org
Subject: RE: client code for searching?

http://wiki.apache.org/solr/SolJava


-Original Message-
From: WHIRLYCOTT [mailto:[EMAIL PROTECTED] 
Sent: Friday, July 14, 2006 1:03 PM
To: solr-user@lucene.apache.org
Subject: Re: client code for searching?

I did before I sent the email to the list, actually.  Is there  
something specific on the wiki that you're able to point me at?

phil.

On Jul 14, 2006, at 3:00 PM, Brian Lucas wrote:

> Check the wiki, my friend.
> http://wiki.apache.org/solr
>
>
>
> -Original Message-
> From: WHIRLYCOTT [mailto:[EMAIL PROTECTED]
> Sent: Friday, July 14, 2006 12:35 PM
> To: solr-user@lucene.apache.org
> Subject: Re: client code for searching?
>
> Yes, I need java, but I would be eager to read your python code to
> get some design ideas from it.
>
> phil.
>
> On Jul 14, 2006, at 2:32 PM, Mike Klaas wrote:
>
>> On 7/14/06, WHIRLYCOTT <[EMAIL PROTECTED]> wrote:
>>> Does anybody have some client code for performing searches against a
>>> Solr installation?  I've seen the DocumentManagerClient for adding/
>>> dropping/etc docs from the index, but I don't see any client code in
>>> svn anywhere.
>>
>> I've written some client code for doing such in python--I assume
>> you're looking for java?
>>
>> Did the list reach a consensus on where client for various languages
>> fit into the grand scheme of things?
>>
>> -Mike
>
>
> --
> Whirlycott
> Philip Jacob
> [EMAIL PROTECTED]
> http://www.whirlycott.com/phil/
>
>


--
Whirlycott
Philip Jacob
[EMAIL PROTECTED]
http://www.whirlycott.com/phil/




RE: Lucene versioning policy

2006-08-16 Thread Darren Vengroff
Any update on possible graduation from the incubator?  Any chance it could
coincide with Hoss's presentation at ApacheCon?

-D

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 08, 2006 11:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Lucene versioning policy


: Also, are there any plans to split solr into a release/development mode?
:
: I'd really like to use solr in a commercial setting, but having nothing
but
: nightly builds available makes me uneasy.

I believe that as long as Solr is in the incubator, nightly builds are the
only releases we are allowed to have.  This is a side note in the
incubation policy about exiting incubation...

   Note: incubator projects are not permitted to issue an official
   Release. Test snapshots (however good the quality) and Release
   plans are OK.

...of course, there is some conflicting info higher up in the same doc
that suggests they are allowed, but they require jumping through some
hoops...

http://incubator.apache.org/incubation/Incubation_Policy.html#Releases


-Hoss