Help with tuning solr

2007-02-13 Thread Ian Meyer

All,

I'm having some performance issues with solr. I will give some
background on our setup and implementation of solr. I'm completely
open to reworking everything if the way we are currently doing things
are not optimal. I'll try to be as verbose as I can in explaining all
of this, but feel free to ask more questions if something doesn't make
sense.

Firstly, we have three messageboards of varying traffic, totaling
about 225K hits per day. Search is
used maybe 500 times a day. Each board has it's two instances of solr,
with Tomcat as the container, and loaded via JDNI. One instance is for
topics, one instance for the posts themselves. I feel as though this
may not be optimal, but I can't think of a better way to handle this.
After reading the schema, maybe someone will have some better ideas.
We use php to interface with solr, and we do some sorting on relevance
and on the date and my thought was that could be causing solr to run
out of memory.

The boards are bco, vlv and wbc. I'll list the number of docs for each
below along with how many added per day.

bco (topics): 180,530 (~200 added daily)
bco (posts): 3,961,053 (~5,000 added daily)
vlv (topics): 3,817 (~200 added daily)
vlv (posts): 84,005 (~7,000 added daily)
wbc (topics): 29,603 (~50 added daily)
wbc (posts):  739,660 (~1000 added daily)

total: ~5 million total docs, with ~13.5K added per day.

we add docs at :00 for bco, :20 for wbc, :40 for vlv. we feel an hour
is a good enough amount of time to where results aren't lagged too
much.  the add process is fast, as well as the commit and i'm more
than impressed with solr's ability to handle the load it does.

The server hardware is 4GB memory, 1 dual-core 2GHZ opteron.. RAID 10
SATA.. the machine runs PostgreSQL, PHP and Apache. I feel that this
isn't optimal either, but the costs to buy another server to separate
either the solr or Postgres component is too great right now. Most of
the errors I see are the jvm running out of heap space. The jvm is set
to use the default for max heap size (256m I think?). I can't increase
it too much, because Postgres needs as much memory as it can so the
databases will still reside in memory.

My first implementation of search for these sites was with pyLucene,
and while that was fast, there was some sort of bug where if I added
docs to the index, they wouldn't show up until I optimized the index,
and that eventually just ate up too much cpu and hosed the server
while it ran, which eventually started taking upwards of 2 hours of
99% cpu and that's just no good. :)

When I set up solr, I had cache warming enabled and that also caused
the server to choke way too soon.  So I turned that off and that
seemed to hold things off for awhile.

I've attached the schemas and configs to this email so you can see how
we have things set up. Every site is the same (config-wise) so just
the names are different. It's relatively simple and I feel like the
jvm shouldn't be choking so soon, but, who knows. :)

One thought we had was having two instances of solr, with a board_id
field and the id field as the unique id, but I wasn't sure if solr
supported compound unique ids.. if not, that would make that solution
moot.

Hopefully this makes sense, but if not, ask me for clarification on
whatever is unclear.

Thanks in advance for your help and suggestions!
Ian



  













  


  

  


  






  
  







  



  







  


 

   
   
   
   
   
   

 

 id

 body

 

 






  /opt/db/solr/bco_posts
  
false
10
1000
2147483647
1
1000
1
  

  

false
10
1000
2147483647
1
false
  
  
 
  1

  


  
1024
false
false
  

  

 
   explicit
 
  
  
solr
solrconfig.xml schema.xml admin-extra.html

 qt=dismax&q=solr&start=3&fq=id:[* TO *]&fq=cat:[* TO *]

  




Re: Help with tuning solr

2007-02-13 Thread Ian Meyer

On 2/13/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

Yes, sorting by fields does take up memory (the fieldcache).
256M is pretty small for a 5M doc index.
If you have any more memory slots, spring for some more memory (a
little over $100 for 1GB).


Yeah, I'll see if I can give solr a bit more.



Lucene also likes to have free memory left over available for OS cache
-  otherwise searches start to be limited by disk bandwidth... not a
good thing.



To try and lessen the memory used by the Lucene FieldCache, you might
try lowering the mergeFactor of the index (see solrconfig.xml).  This
will cause more merges, slowing indexing, but it will squeeze out
deleted documents faster.  Also, try to optimize as often as possible
(nightly?) for the same reasons.


Ah, I don't know if I mentioned, but we're optimizing nightly when
impressions are at their lowest. So, I will lower the mergeFactor and
re-load all of the docs to see if that helps us out.. I believe I left
it high when we were tuning for the initial loading of ~4M docs before
we realized batching them into groups of 1000 before doing a commit
(instead of add, commit, add commit, etc) was a more efficient way of
doing it. As it stands, loading ~600 docs takes about 2 seconds, so if
it takes 15 seconds, I won't complain. :)

Thanks for the tips.

- Ian



-Yonik

On 2/13/07, Ian Meyer <[EMAIL PROTECTED]> wrote:
> All,
>
> I'm having some performance issues with solr. I will give some
> background on our setup and implementation of solr. I'm completely
> open to reworking everything if the way we are currently doing things
> are not optimal. I'll try to be as verbose as I can in explaining all
> of this, but feel free to ask more questions if something doesn't make
> sense.
>
> Firstly, we have three messageboards of varying traffic, totaling
> about 225K hits per day. Search is
> used maybe 500 times a day. Each board has it's two instances of solr,
> with Tomcat as the container, and loaded via JDNI. One instance is for
> topics, one instance for the posts themselves. I feel as though this
> may not be optimal, but I can't think of a better way to handle this.
> After reading the schema, maybe someone will have some better ideas.
> We use php to interface with solr, and we do some sorting on relevance
> and on the date and my thought was that could be causing solr to run
> out of memory.
>
> The boards are bco, vlv and wbc. I'll list the number of docs for each
> below along with how many added per day.
>
> bco (topics): 180,530 (~200 added daily)
> bco (posts): 3,961,053 (~5,000 added daily)
> vlv (topics): 3,817 (~200 added daily)
> vlv (posts): 84,005 (~7,000 added daily)
> wbc (topics): 29,603 (~50 added daily)
> wbc (posts):  739,660 (~1000 added daily)
>
> total: ~5 million total docs, with ~13.5K added per day.
>
> we add docs at :00 for bco, :20 for wbc, :40 for vlv. we feel an hour
> is a good enough amount of time to where results aren't lagged too
> much.  the add process is fast, as well as the commit and i'm more
> than impressed with solr's ability to handle the load it does.
>
> The server hardware is 4GB memory, 1 dual-core 2GHZ opteron.. RAID 10
> SATA.. the machine runs PostgreSQL, PHP and Apache. I feel that this
> isn't optimal either, but the costs to buy another server to separate
> either the solr or Postgres component is too great right now. Most of
> the errors I see are the jvm running out of heap space. The jvm is set
> to use the default for max heap size (256m I think?). I can't increase
> it too much, because Postgres needs as much memory as it can so the
> databases will still reside in memory.
>
> My first implementation of search for these sites was with pyLucene,
> and while that was fast, there was some sort of bug where if I added
> docs to the index, they wouldn't show up until I optimized the index,
> and that eventually just ate up too much cpu and hosed the server
> while it ran, which eventually started taking upwards of 2 hours of
> 99% cpu and that's just no good. :)
>
> When I set up solr, I had cache warming enabled and that also caused
> the server to choke way too soon.  So I turned that off and that
> seemed to hold things off for awhile.
>
> I've attached the schemas and configs to this email so you can see how
> we have things set up. Every site is the same (config-wise) so just
> the names are different. It's relatively simple and I feel like the
> jvm shouldn't be choking so soon, but, who knows. :)
>
> One thought we had was having two instances of solr, with a board_id
> field and the id field as the unique id, but I wasn't sure if solr
> supported compound unique ids.. if not, that would make that solution
> moot.
>
> Hopefully this makes sense, but if not, ask me for clarification on
> whatever is unclear.
>
> Thanks in advance for your help and suggestions!
> Ian



Re: Solr logo poll

2007-04-06 Thread Ian Meyer

A.

On 4/6/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

Quick poll...  Solr 2.1 release planning is underway, and a new logo
may be a part of that.
What "form" of logo do you prefer, A or B?  There may be further
tweaks to these pictures, but I'd like to get a sense of what the user
community likes.

A) http://issues.apache.org/jira/secure/attachment/12349897/logo-solr-d.jpg

B) 
http://issues.apache.org/jira/secure/attachment/12353535/12353535_solr-nick.gif

Just respond to this thread with your preference.

-Yonik