Re: duplicate entries being returned, possible caching issue?

2008-02-03 Thread Yonik Seeley
I would guess you are seeing a view of the index after adding some
documents but before the duplicates have been removed.  Are you using
Solr's replication scripts?

-Yonik

On Feb 1, 2008 6:01 PM, Rachel McConnell <[EMAIL PROTECTED]> wrote:
> We have just started seeing an intermittent problem in our production
> Solr instances, where the same document is returned twice in one
> request.  Most of the content of the response consists of duplicates.
> It's not consistent; maybe 1/3 of the time this is happening and the
> rest of the time, one return document is sent per actual Solr
> document.
>
> We recently made some changes to our caching strategy, basically to
> increase the values across the board.  This is the only change to our
> Solr instance for quite some time.
>
> Our production system consists of the following:
>
> * 'write', a Solr server used as the master index, optimized for
> writes.  all 3 application servers use this
> * 'read1' & 'read2', Solr servers optimized for reads, which synch
> from the master every 20 minutes.  these two are behind a pound load
> balancer.  Two application servers use these for searching.
> * 'read3', a Solr server identical to read1 & read2, but which is not
> load balanced, and used by only one application server.
>
> Has anyone any ideas how to start debugging this?  What information
> should I be looking for that could shed some light on this?
>
> Thanks for any advice,
> Rachel
>


Field Compression

2008-02-03 Thread Stu Hood
I just finished watching this talk about a column-store RDBMS, which has a long 
section on column compression. Specifically, it talks about the gains from 
compressing similar data together, and how lazily decompressing data only when 
it must be processed is great for memory/CPU cache usage.

http://youtube.com/watch?v=yrLd-3lnZ58

While interesting, its not relevant to Lucene's stored field storage. On the 
other hand, it did get me thinking about stored field compression and lazy 
field loading.

Can anyone give me some pointers about compressThreshold values that would be 
worth experimenting with? Our stored fields are often between 20 and 300 
characters, and we're willing to spend more time indexing if it will make 
searching less IO bound.

Thanks,

Stu Hood
Architecture Software Developer
Mailtrust, a Rackspace Company



Re: Field Compression

2008-02-03 Thread Mike Klaas


On 3-Feb-08, at 1:34 PM, Stu Hood wrote:

I just finished watching this talk about a column-store RDBMS,  
which has a long section on column compression. Specifically, it  
talks about the gains from compressing similar data together, and  
how lazily decompressing data only when it must be processed is  
great for memory/CPU cache usage.


http://youtube.com/watch?v=yrLd-3lnZ58

While interesting, its not relevant to Lucene's stored field  
storage. On the other hand, it did get me thinking about stored  
field compression and lazy field loading.


Can anyone give me some pointers about compressThreshold values  
that would be worth experimenting with? Our stored fields are often  
between 20 and 300 characters, and we're willing to spend more time  
indexing if it will make searching less IO bound.


Field compression can save you space and converts the field into a  
binary field, which is lazy-loaded more efficiently than a string  
field.  As for the threshold, I use 200 on a multi-kilobyte field,  
but this doesn't mean that it isn't effective on smaller fields.   
Experimentation on small indices followed by claculating the avg.  
stored bytes/docs is usually fruitful.


Of course, the best way to improve performance in this regard is to  
store the less-frequently-used fields in a parallel solr index.  This  
only works if the largest fields are the rarely-used ones, though  
(like retrieving the doc contents to create a summary).


-Mike