Re: system architecture question when using solr/lucene

2007-05-21 Thread Ajanta Phatak
Thanks to both of you for your responses - Otis and Chris. We did manage 
to run some benchmarks, but we think there are some surprising results 
here. It seems that caching is not affecting performance that much. Is 
that because of the small index size?


Do these seem ok or is there any room for improvement in anyway that you 
could think of?


Regards,
Ajanta.

Results from development servers
Solr 
HTTP Interface
Configurations 
 



   * Index size is approx 500M (a little more)
   * Tomcat 6.0
   * Solr (nightly build dated 2007-04-19)
   * Nginx v0.5.20 is used as load balancer (very light weight in size,
 functionality and cpu consumption) with round-robin distribution
 of requests.
   * Grinder v3.0-beta33 was used for testing. This allows one to write
 custom scripts (in jython) and has nice GUI interface for
 presenting results.
   * Server Config : IntelĀ® Xeon^(TM) 3040 1.87Ghz 1066MHz, 4GB RAM
 (system boot usage 300MB), 8GB swap
   * Querylist was custom build from web with some of them having
 AND/OR between terms. territory field was always US.

Benchmarks
 

Threads 	Servers 	Total queries/ Unique Queries 	Caching 	Performance 
(queries/sec)

25  2   2500/1950   D*  500
25  2   2500/2500   D   142
40  2   4000/4000   D   100
40  2   4000/3000   D   166
40  3   4000/4000   D   133
40(backtoback)  3   4000/4000   D   333
40  3   4000/3300   D   142
10  3   2000/2000   D   434
40  3   4000/4000   Q.Caching: 1024 158
40(backtoback)  3   4000/4000   Q.Caching: 1024 384


Without US territory
 

Threads 	Servers 	Total queries/ Unique Queries 	Caching 	Performance 
(queries/sec)

40  3   4000/4000   D   142
40  2   4000/4000   D   100


Moving territory:US from query to Filters
 

Threads 	Servers 	Total queries/ Unique Queries 	Caching 	Performance 
(queries/sec)

40  3   4000/4000   F.Caching :16384133
40  3   4000/3400   F.Caching :16384147

   * D implies caching was disabled
   * *backtoback* implies same code was run again
   * CPU usage when server was processing query was ~40-50%
   * Tomcat shows 3% memory usage.



Otis Gospodnetic wrote:

Hi Ajanta,

I think you answered your own questions.  Either use Filters or partition the 
index.  The advantage of partitioning is that you can update them separately 
without affecting filters, cache, searcher, etc. for the other indices (i.e. no 
need to warm up with data from the other indices).  If you are indeed working 
with the high QPS, partitioning also lets you scale indices separately (are all 
territories the same size document-wise?  do they all get the same QPS?).  The 
disadvantage is that you can't easily run queries that don't depend on a 
territory.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lucene Consulting -- http://lucene-consulting.com/


- Original Message 
From: Ajanta <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, May 15, 2007 11:35:13 AM
Subject: system architecture question when using solr/lucene



We are currently looking at large numbers of queries/sec and would like to
optimize that as much as possible. The special need is that we would like to
show specific results based on a specific field - territory field and
depending on where in the world you're coming from we'd like to show you
specific results. The  index is very large (currently 2 million rows) and
could grow even larger (2-3 times) in the future. How do we accomplish this
given that we have some domain knowledge (the territory) to use to our
advantage? Is there a way we can hint solr/lucene to use this information to
provide better results? We could use filters on territory or we could use
different indexes for different territories (individually or in a
combination.)  Are there any other ways to do this? How do we figure out the
best case in this situation?


  


Re: system architecture question when using solr/lucene

2007-05-21 Thread Yonik Seeley

What are some typical examples of your queries (all of the params that
are sent to Solr)?
Query and Document caches typically result in small increases in performance.
The filterCache can result in large increases, depending on the queries.

Another possibility is that you may be hitting some other bottleneck,
possibly caused by synchronization... 40 threads seems kind of high
(unless they pause between requests).

-Yonik

On 5/21/07, Ajanta Phatak <[EMAIL PROTECTED]> wrote:

Thanks to both of you for your responses - Otis and Chris. We did manage
to run some benchmarks, but we think there are some surprising results
here. It seems that caching is not affecting performance that much. Is
that because of the small index size?

Do these seem ok or is there any room for improvement in anyway that you
could think of?

Regards,
Ajanta.

Results from development servers
Solr
HTTP Interface
Configurations



* Index size is approx 500M (a little more)
* Tomcat 6.0
* Solr (nightly build dated 2007-04-19)
* Nginx v0.5.20 is used as load balancer (very light weight in size,
  functionality and cpu consumption) with round-robin distribution
  of requests.
* Grinder v3.0-beta33 was used for testing. This allows one to write
  custom scripts (in jython) and has nice GUI interface for
  presenting results.
* Server Config : Intel(r) Xeon^(TM) 3040 1.87Ghz 1066MHz, 4GB RAM
  (system boot usage 300MB), 8GB swap
* Querylist was custom build from web with some of them having
  AND/OR between terms. territory field was always US.

Benchmarks


Threads Servers Total queries/ Unique Queries   Caching 
Performance
(queries/sec)
25  2   2500/1950   D*  500
25  2   2500/2500   D   142
40  2   4000/4000   D   100
40  2   4000/3000   D   166
40  3   4000/4000   D   133
40(backtoback)  3   4000/4000   D   333
40  3   4000/3300   D   142
10  3   2000/2000   D   434
40  3   4000/4000   Q.Caching: 1024 158
40(backtoback)  3   4000/4000   Q.Caching: 1024 384


Without US territory


Threads Servers Total queries/ Unique Queries   Caching 
Performance
(queries/sec)
40  3   4000/4000   D   142
40  2   4000/4000   D   100


Moving territory:US from query to Filters


Threads Servers Total queries/ Unique Queries   Caching 
Performance
(queries/sec)
40  3   4000/4000   F.Caching :16384133
40  3   4000/3400   F.Caching :16384147

* D implies caching was disabled
* *backtoback* implies same code was run again
* CPU usage when server was processing query was ~40-50%
* Tomcat shows 3% memory usage.



Otis Gospodnetic wrote:
> Hi Ajanta,
>
> I think you answered your own questions.  Either use Filters or partition the 
index.  The advantage of partitioning is that you can update them separately 
without affecting filters, cache, searcher, etc. for the other indices (i.e. no 
need to warm up with data from the other indices).  If you are indeed working with 
the high QPS, partitioning also lets you scale indices separately (are all 
territories the same size document-wise?  do they all get the same QPS?).  The 
disadvantage is that you can't easily run queries that don't depend on a territory.
>
> Otis
>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Lucene Consulting -- http://lucene-consulting.com/
>
>
> - Original Message 
> From: Ajanta <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, May 15, 2007 11:35:13 AM
> Subject: system architecture question when using solr/lucene
>
>
>
> We are currently looking at large numbers of queries/sec and would like to
> optimize that as much as possible. The special need is that we would like to
> show specific results based on a specific field - territory field and
> depending on where in the world you're coming from we'd like to show you
> specific results. The  index is very large (currently 2 million rows) and
> could grow even larger (2-3 times) in the future. How do we accomplish this
> given that we have some domain knowledge (the territory) to use to our
> advantage? Is there a way we can hint solr/lucene to use this information to
> provide better results? We could use filters on territory or we could use
> different indexes for different territ

how to use function queries

2007-05-21 Thread mike topper
I'm trying to retrieve results from solr such that newer documents' 
scores are boosted.  From the solr wiki it states that I should use a 
function query to influence the score but I'm a little confused on howto 
use a function query.


Searching through the archives I found a suggestion of using the _val_: 
hack in the standard query handler, but when i tried that with


recip(rord(date),1,1000,1000)^2

to just test it I got an error saying

org.apache.solr.core.SolrException: undefined field recip

Can someone explain the function queries a little clearer and if I would need 
to use a different query handler?

-Mike




How to handle hl.fl form variable (any variable with a dot in its name) from javascript?

2007-05-21 Thread Teruhiko Kurosaka
I have a form that sets the hl.fl form hidden variable.
I wanted to change the higlighted field depending on the
query string that is typed, using JavaScript.
This is normally done by the JavaScript code like this:
document.myform.varname.value = "whatever"
But this doesn't work for hl.fl because the name of
the variable includes a dot that JavaScript interprets
as a layer separator:
document.myform.hl.fl.value
would be interpreted as document.my form.hl's
subelement named fl.  I tried this but it didn't work:
document.myforom.hl\u002Efl.value

Is there any other way to change the form variable
named hl.fl?

For the futuer versions, may I suggest removing the
dots from this and other variables?

-kuro


Re: system architecture question when using solr/lucene

2007-05-21 Thread James liu

first u should know ur goal.

second u should analyzer ur search interface which fit for ur customer

third u analyzer ur queries(optimize solr with more used queries)

40 Threads /s does it mean u use 40 solr instances or it just show higher
user queries?


2007/5/21, Yonik Seeley <[EMAIL PROTECTED]>:


What are some typical examples of your queries (all of the params that
are sent to Solr)?
Query and Document caches typically result in small increases in
performance.
The filterCache can result in large increases, depending on the queries.

Another possibility is that you may be hitting some other bottleneck,
possibly caused by synchronization... 40 threads seems kind of high
(unless they pause between requests).

-Yonik

On 5/21/07, Ajanta Phatak <[EMAIL PROTECTED]> wrote:
> Thanks to both of you for your responses - Otis and Chris. We did manage
> to run some benchmarks, but we think there are some surprising results
> here. It seems that caching is not affecting performance that much. Is
> that because of the small index size?
>
> Do these seem ok or is there any room for improvement in anyway that you
> could think of?
>
> Regards,
> Ajanta.
>
> Results from development servers
> <
https://storesvn.limewire.com/trac/limestore/wiki/BenchmarkResults#Resultsfromdevelopmentservers
>Solr
> HTTP Interface
> Configurations
> <
https://storesvn.limewire.com/trac/limestore/wiki/BenchmarkResults#Configurations
>
>
>
> * Index size is approx 500M (a little more)
> * Tomcat 6.0
> * Solr (nightly build dated 2007-04-19)
> * Nginx v0.5.20 is used as load balancer (very light weight in size,
>   functionality and cpu consumption) with round-robin distribution
>   of requests.
> * Grinder v3.0-beta33 was used for testing. This allows one to write
>   custom scripts (in jython) and has nice GUI interface for
>   presenting results.
> * Server Config : Intel(r) Xeon^(TM) 3040 1.87Ghz 1066MHz, 4GB RAM
>   (system boot usage 300MB), 8GB swap
> * Querylist was custom build from web with some of them having
>   AND/OR between terms. territory field was always US.
>
> Benchmarks
> <
https://storesvn.limewire.com/trac/limestore/wiki/BenchmarkResults#Benchmarks
>
>
> Threads Servers Total queries/ Unique Queries
Caching Performance
> (queries/sec)
> 25  2   2500/1950   D*  500
> 25  2   2500/2500   D   142
> 40  2   4000/4000   D   100
> 40  2   4000/3000   D   166
> 40  3   4000/4000   D   133
> 40(backtoback)  3   4000/4000   D   333
> 40  3   4000/3300   D   142
> 10  3   2000/2000   D   434
> 40  3   4000/4000   Q.Caching: 1024 158
> 40(backtoback)  3   4000/4000   Q.Caching: 1024 384
>
>
> Without US territory
> <
https://storesvn.limewire.com/trac/limestore/wiki/BenchmarkResults#WithoutUSterritory
>
>
> Threads Servers Total queries/ Unique Queries
Caching Performance
> (queries/sec)
> 40  3   4000/4000   D   142
> 40  2   4000/4000   D   100
>
>
> Moving territory:US from query to Filters
> <
https://storesvn.limewire.com/trac/limestore/wiki/BenchmarkResults#Movingterritory:USfromquerytoFilters
>
>
> Threads Servers Total queries/ Unique Queries
Caching Performance
> (queries/sec)
> 40  3   4000/4000   F.Caching :16384133
> 40  3   4000/3400   F.Caching :16384147
>
> * D implies caching was disabled
> * *backtoback* implies same code was run again
> * CPU usage when server was processing query was ~40-50%
> * Tomcat shows 3% memory usage.
>
>
>
> Otis Gospodnetic wrote:
> > Hi Ajanta,
> >
> > I think you answered your own questions.  Either use Filters or
partition the index.  The advantage of partitioning is that you can update
them separately without affecting filters, cache, searcher, etc. for the
other indices (i.e. no need to warm up with data from the other
indices).  If you are indeed working with the high QPS, partitioning also
lets you scale indices separately (are all territories the same size
document-wise?  do they all get the same QPS?).  The disadvantage is that
you can't easily run queries that don't depend on a territory.
> >
> > Otis
> >  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > Lucene Consulting -- http://lucene-consulting.com/
> >
> >
> > - Original Message 
> > From: Ajanta <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Tuesday, May 15, 2007 11:35:13 AM
> > Subject: system architecture question when using solr/lucene
> >
> >
> >
> > We are currently looking at large numbers of queries/sec and would
like to
> > optimize that as much as possible. The special need is that we would
like to
> > show specific results based on a specific field - territory field and
> > depending on where in 

TEI indexing

2007-05-21 Thread Gary Browne
Once again, thanks for your help getting Solr up and running.

 

I'm wondering if anyone has any hints on how to prepare TEI documents
for indexing - I was about to write some XSLT but didn't want to
reinvent the wheel (unless it's punctured)?

 

Regards

Gary

 

 

Gary Browne
Development Programmer
Library IT Services
University of Sydney
Australia
ph: 61-2-9351 5946 

 



Re: How to handle hl.fl form variable (any variable with a dot in its name) from javascript?

2007-05-21 Thread Ryan McKinley

Teruhiko Kurosaka wrote:

I have a form that sets the hl.fl form hidden variable.
I wanted to change the higlighted field depending on the
query string that is typed, using JavaScript.
This is normally done by the JavaScript code like this:
document.myform.varname.value = "whatever"
But this doesn't work for hl.fl because the name of
the variable includes a dot that JavaScript interprets
as a layer separator:
document.myform.hl.fl.value
would be interpreted as document.my form.hl's
subelement named fl.  I tried this but it didn't work:
document.myforom.hl\u002Efl.value

Is there any other way to change the form variable
named hl.fl?



can't you use something like:

document.forms["FormName"].elements["Input.Name"].value

or:



document.getElementById( "myhlfield" ).value



For the futuer versions, may I suggest removing the
dots from this and other variables?



the dot syntax is used in many places, and can't easily be removed...

ryan