Re: stylesheet issue

2006-06-02 Thread Yonik Seeley

On 6/2/06, Tim Archambault <[EMAIL PROTECTED]> wrote:

I've got solr installed and running, with only one failure left to date.
Whenver I try to select a stylesheet for my search, I get an error message
such as this:


Hi Tim,

There is no stylesheet :-)

It's a hold-over from an old XML format that Solr used to support
before it was open-sourced.  That old XML format was for compatibility
with another internal product.  It turned out that it wasn't flexible
enough to add extra info like multiple result sets, or faceted
browsing info, so we came up with v2 of the XML (but no new stylesheet
to go with it).

The XML is fairly readable though, so it hasn't been much of a problem
in practice.

-Yonik


Re: stylesheet issue

2006-06-02 Thread Tim Archambault

That'll be fine. As you can probably tell, I'm not a programmer. I am just a
dangerous end-user with expertise in marketing & online operations trying to
save a buck. I am going to try to learn XSL or if that doesn't work, I'll
bastardize the results into a coldfusion recordset.

I know I shouldn't ask you questions directly, but I have to ask you.

How many queries per minute can Solr handle in a high use situation? Our
website gets about 4 million page views a month and about 40,000 daily
visitors, which is about an hour for CNET probably. I am envisioning Solr
being the search engine for our jobs, autos, classifieds, and as a "global"
search experience that includes them all. I really want to greatly limit the
use of database connections on our site. Do you think Solr can be a "global"
solution for search on our site. It's one thing to test, yet another in a
production environment.

Which java-based web server component do you recommend for a windows
platform? Tomcat? Another? I know nothing about these tools. I am using
Jetty for testing.

Thank you for all your help.

Tim



On 6/2/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 6/2/06, Tim Archambault <[EMAIL PROTECTED]> wrote:
> I've got solr installed and running, with only one failure left to date.
> Whenver I try to select a stylesheet for my search, I get an error
message
> such as this:

Hi Tim,

There is no stylesheet :-)

It's a hold-over from an old XML format that Solr used to support
before it was open-sourced.  That old XML format was for compatibility
with another internal product.  It turned out that it wasn't flexible
enough to add extra info like multiple result sets, or faceted
browsing info, so we came up with v2 of the XML (but no new stylesheet
to go with it).

The XML is fairly readable though, so it hasn't been much of a problem
in practice.

-Yonik



Re: stylesheet issue

2006-06-02 Thread Yonik Seeley

On 6/2/06, Tim Archambault <[EMAIL PROTECTED]> wrote:

That'll be fine. As you can probably tell, I'm not a programmer. I am just a
dangerous end-user with expertise in marketing & online operations trying to
save a buck. I am going to try to learn XSL or if that doesn't work, I'll
bastardize the results into a coldfusion recordset.

I know I shouldn't ask you questions directly, but I have to ask you.

How many queries per minute can Solr handle in a high use situation?


It depends on how many documents are in the collection, the nature of
the documents (unique terms, size of fields, etc), and heavily depends
on the nature of the queries, and the CPU and memory of your hardware.

I've seen up to 1000 queries/sec for very simple queries on a 1M doc index.


Our
website gets about 4 million page views a month and about 40,000 daily
visitors,


That shouldn't be a problem unless the collection is just too big.
It's pretty easy to scale Solr to higher query traffic by putting more
query servers behind a load balancer, *provided* that the latency of a
single query is acceptable.  If the collection is too big (to many
documents, to big of documents), then you need to split up the
collection and use federated search (Solr doesn't have it yet, but it
will in the future).


I am envisioning Solr
being the search engine for our jobs, autos, classifieds, and as a "global"
search experience that includes them all. I really want to greatly limit the
use of database connections on our site. Do you think Solr can be a "global"
solution for search on our site.


By "global" do you mean Solr as the search solution for all those
collections, or do you mean having all those different types of
documents (jobs, autos, classifieds) in a single Solr index?

Unless there is a good reason to put multiple document types in the
same index, you will get better performance by putting them in their
own index.


Which java-based web server component do you recommend for a windows
platform? Tomcat? Another? I know nothing about these tools. I am using
Jetty for testing.


Tomcat is the most widely used I think... and therefore easier to find
docs and find help/support for it.  I started a little Tomcat
installation guide on the Wiki last night.

-Yonik


Re: stylesheet issue

2006-06-02 Thread Tim Archambault

By "global" do you mean Solr as the search solution for all those
collections, or do you mean having all those different types of
documents (jobs, autos, classifieds) in a single Solr index?
Yes I did. I envisioned separating them by custom fields named "vertical"
and then within vertical "category"

Unless there is a good reason to put multiple document types in the
same index, you will get better performance by putting them in their
own index.
So my educated guess would be that I would create additional "schema" xml
elements in my schema.xml separately for jobs, homes, cars, news, obits, etc
( in the tutorial, I note the schema name "example") and my search query
strings would have to specify which schema to use in the query, but I don't
see a variable for "schema".

NumDocs: It looks like I am going to have an index of about 300,000
documents initially and should grow by about 150 per day..


On 6/2/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 6/2/06, Tim Archambault <[EMAIL PROTECTED]> wrote:
> That'll be fine. As you can probably tell, I'm not a programmer. I am
just a
> dangerous end-user with expertise in marketing & online operations
trying to
> save a buck. I am going to try to learn XSL or if that doesn't work,
I'll
> bastardize the results into a coldfusion recordset.
>
> I know I shouldn't ask you questions directly, but I have to ask you.
>
> How many queries per minute can Solr handle in a high use situation?

It depends on how many documents are in the collection, the nature of
the documents (unique terms, size of fields, etc), and heavily depends
on the nature of the queries, and the CPU and memory of your hardware.

I've seen up to 1000 queries/sec for very simple queries on a 1M doc
index.

> Our
> website gets about 4 million page views a month and about 40,000 daily
> visitors,

That shouldn't be a problem unless the collection is just too big.
It's pretty easy to scale Solr to higher query traffic by putting more
query servers behind a load balancer, *provided* that the latency of a
single query is acceptable.  If the collection is too big (to many
documents, to big of documents), then you need to split up the
collection and use federated search (Solr doesn't have it yet, but it
will in the future).

> I am envisioning Solr
> being the search engine for our jobs, autos, classifieds, and as a
"global"
> search experience that includes them all. I really want to greatly limit
the
> use of database connections on our site. Do you think Solr can be a
"global"
> solution for search on our site.

By "global" do you mean Solr as the search solution for all those
collections, or do you mean having all those different types of
documents (jobs, autos, classifieds) in a single Solr index?

Unless there is a good reason to put multiple document types in the
same index, you will get better performance by putting them in their
own index.

> Which java-based web server component do you recommend for a windows
> platform? Tomcat? Another? I know nothing about these tools. I am using
> Jetty for testing.

Tomcat is the most widely used I think... and therefore easier to find
docs and find help/support for it.  I started a little Tomcat
installation guide on the Wiki last night.

-Yonik



Re: stylesheet issue

2006-06-02 Thread Yonik Seeley

On 6/2/06, Tim Archambault <[EMAIL PROTECTED]> wrote:

So my educated guess would be that I would create additional "schema" xml


Solr doesn't support multiple schemas.  The current way to do this is
to run multiple instances of Solr.  Another way is to run multiple
Solr webapps in the same servlet container... slightly harder for
config, but easier on memory.


NumDocs: It looks like I am going to have an index of about 300,000
documents initially and should grow by about 150 per day..


300,000 isn't too bad at all... you should be able to get away will
adding all the different document types to the same index.  If you
want to be able to search across multiple verticles in a single
request, this is the way to go.

You could always split it out later if performance becomes an issue.

-Yonik


Re: stylesheet issue

2006-06-02 Thread Tim Archambault

Thanks again for all your help. You've been great. Someday I may want to
convert our xml archives into the search, but not yet. Sounds like Solr will
be more scalable in the future and that may be feasible. Have a great
weekend.

On 6/2/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 6/2/06, Tim Archambault <[EMAIL PROTECTED]> wrote:
> So my educated guess would be that I would create additional "schema"
xml

Solr doesn't support multiple schemas.  The current way to do this is
to run multiple instances of Solr.  Another way is to run multiple
Solr webapps in the same servlet container... slightly harder for
config, but easier on memory.

> NumDocs: It looks like I am going to have an index of about 300,000
> documents initially and should grow by about 150 per day..

300,000 isn't too bad at all... you should be able to get away will
adding all the different document types to the same index.  If you
want to be able to search across multiple verticles in a single
request, this is the way to go.

You could always split it out later if performance becomes an issue.

-Yonik



Re: SolPHP

2006-06-02 Thread Michael J. Giarlo

Looking forward to those, Brian.  Thanks!

Will you post an announcement to the list or should we ping the wiki 
every so often?


-Mike


Brian Lucas wrote:

Erik,

I'll get the PHP bindings out to see how they suit the needs of people and
use that feedback for the Rails bindings.  I'm looking forward to seeing how
they could be implemented as well.  
Brian




Re: SolPHP

2006-06-02 Thread Erik Hatcher


On Jun 2, 2006, at 12:43 PM, Michael J. Giarlo wrote:
Will you post an announcement to the list or should we ping the  
wiki every so often?


wiki changes get e-mailed to solr-commits@lucene.apache.org - simply  
subscribe ([EMAIL PROTECTED]) and you'll see  
changes as they happen.


Re: SolPHP

2006-06-02 Thread Yonik Seeley

On 6/2/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:

On Jun 2, 2006, at 12:43 PM, Michael J. Giarlo wrote:
> Will you post an announcement to the list or should we ping the
> wiki every so often?

wiki changes get e-mailed to solr-commits@lucene.apache.org - simply
subscribe ([EMAIL PROTECTED]) and you'll see
changes as they happen.


All CVS commits go to solr-commits too.
Subscribers to solr-dev that want to get more involved in development
should probably subscribe to the commits list too.

Current subscriber counts:
solr-user: 94
solr-dev: 45
solr-commits: 17

Thats pretty good considering we've only been in the incubator a few
months, and we've only been "advertizing" in the Lucene mailing list!

-Yonik


Re: stylesheet issue

2006-06-02 Thread Chris Hostetter
: There is no stylesheet :-)
:
: It's a hold-over from an old XML format that Solr used to support
: before it was open-sourced.  That old XML format was for compatibility
: with another internal product.  It turned out that it wasn't flexible
: enough to add extra info like multiple result sets, or faceted
: browsing info, so we came up with v2 of the XML (but no new stylesheet
: to go with it).
:
: The XML is fairly readable though, so it hasn't been much of a problem
: in practice.

Yeah ... the whole way the stylesheet param is handled has allwyas kind of
bugged me ... in the back of my mind, i've been thinking that the right
thing to do would be to change it so if it's specified, the string is used
verbatim as the stylehseet URL instead of hte current practice of
assuming it's in the admin directory -- that way people could either
specify fully qualified URLs on another host, or quasi-relative paths
rooted with / on another webapp of the current host/port, or it could even
be a refrence to get-files.jsp so they could store the XSLTs in their
./solr directory.

another way to go if we add init() params to QueryResponseWriter would be
to make the XmlResponseWriter take in a NamedList of alias=>URL mappings
of all the stylesheets it wanted to support (which could still be served
via get-files.jsp)


-Hoss



Re: SolPHP

2006-06-02 Thread Michael J. Giarlo

Yonik Seeley wrote:


Thats pretty good considering we've only been in the incubator a few
months, and we've only been "advertizing" in the Lucene mailing list!



I bet the NINES project and Bess Sadler's blog post about it increased 
your visibility in the library world tenfold.


That's how I heard about it, at least.  I'm looking for ways to replace 
our current library website (and EAD) search, preferably something 
Lucene-based.  Currently looking at nutch and solr, and trying to figure 
out which is more relevant to what we're doing.


-Mike


Solr with BDB

2006-06-02 Thread jason rutherglen
One of the things we're running into is doing concurrent updates on a document. 
 Robert Engels pointed at a solution which is to save pending operations, I 
think it's possible to go one step further.  

One way to solve this would be to use BDB as storage for the actual data 
(perhaps also the Lucene index, not sure yet).  This would allow for updates, 
deletions, and versioning, without forcing Solr to store all fields and data.   
Each BDB row would have a marker that Solr could look at periodically (on a 
commit) reindex.  I think this would solve the concurrency issues.  

Any thoughts?



Re: SolPHP

2006-06-02 Thread Erik Hatcher


On Jun 2, 2006, at 3:00 PM, Michael J. Giarlo wrote:

Yonik Seeley wrote:

Thats pretty good considering we've only been in the incubator a few
months, and we've only been "advertizing" in the Lucene mailing list!


I bet the NINES project and Bess Sadler's blog post about it  
increased your visibility in the library world tenfold.


That's how I heard about it, at least.  I'm looking for ways to  
replace our current library website (and EAD) search, preferably  
something Lucene-based.  Currently looking at nutch and solr, and  
trying to figure out which is more relevant to what we're doing.


Ah cool!   The blog post is here    
- I did a demo of our (currently not in production) Solr-based  
faceted browser to the UVa library folks last week and they were very  
impressed.  I'm working (as a one-man development "team") as fast as  
I can to get this thing up in the next couple of weeks as the  
del.icio.us(err, simpy!)+flickr+google of 19th century literature,  
and hopefully beyond.  I'll definitely be announcing it to the list  
when it's available for general consumption.


Oddly enough I work _in_ (but not _for_) a library that is a noted  
leader in the digital library space, but most systems are using xpat  
or other archaic search solutions or view search as a pluggable  
service rather than an integral aspect.  It's pretty sad how  
inaccessible the wonderfully rich world of library archives currently  
are.


Erik



RE: Two Solr Announcements: CNET Product Search and DisMax

2006-06-02 Thread Chris Hostetter

: Like I said, what would be more useful is if I (or maybe yonik) can
: find some time to look up some numbers from our Performance testing to
: tell you for a box of type  with a collection of size N how what does
: the graph of X non stop concurrent users vs average response time of Y
: look like while snaploading is happening every M minutes.


I've added a Wiki page for "Solr Performance Data" and added some info
from testing we did at CNET a while back.  I was hoping to be able to
provide some numbers using the DisMaxRequestHandler as is in Solr -- but
I didn't have any time ot run specific tests on it -- the results i posted
to that wiki page are for a modified version that does more work
-- so the response times should be the "upper bound" of what you expect.


Everyone should feel free to add whatever performance data of their own
they feel confortable sharing to this wiki page.



-Hoss