RE: sizing/sanity check for huge(?) dataset

Lance Norskog Fri, 28 Dec 2007 17:51:58 -0800

We have maybe 1.2k per record and a large index works fine.

You want more RAM and more to the point fast disk I/O for reading.
Striped/mirrored the more the better.


Giant indexes fall down on sorting. Any sort creates an array with one
integer for each record in the index for the field, that's 400mbytes for
100m records. If the field is not an integer, it also makes a separate array
with the data in that field. We created a duplicate integer field for each
value we wanted to sort on. You'd think a date would be handled as an
integer because it's 32 bits, but no, it is handled with a separate array of
32-bit dates.

Also, liberal use of filters really helps. That's a semi-permanent set of
results that you can use as an AND against each search.

Lance

-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, December 28, 2007 5:31 PM
To: solr-user@lucene.apache.org
Subject: Re: sizing/sanity check for huge(?) dataset

Hi Al,

165GB of disk space and 100M is big, but not impossibly huge.

The only thing that looks a bit worrisome is the index size : RAM ratio.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Alok Dhir <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, December 28, 2007 3:40:22 PM
Subject: Re: sizing/sanity check for huge(?) dataset

Missed one bit of data:  This dataset will be searched less than 500 times
per day.  The goal is to get results in a reasonable amount of time (<3s),
but the queries coming in per minute will likely max out around 10.

On Dec 28, 2007, at 3:36 PM, Alok Dhir wrote:

> Hey all -- first, thanks to the solr & lucene teams for fantastic 
> products.  So far we're very pleased with the results we're seeing 
> from them.  We're looking at it as the primary search solution for a
  
> rather large dataset.  Hoping for a comments/sanity check from people 
> who "know".
>
> Looking at a deploying solr tp search around 100M docs, totalling 
> around 165G of space.  Would this be considered "huge"?  It seems so
  
> given the posts I've read on the list.  In any case...
>
> Schema currently looks as follows (using type definitions from 
> "example" schema.xml):
>
> <field name="instance" type="string" indexed="true" stored="true"  
> required="true" />
> <field name="instance_id" type="string" indexed="true" stored="true"
  
> required="true" />
> <field name="id" type="string" indexed="true" stored="true"  
> required="true" />
> <field name="label" type="text" indexed="true" stored="true"  
> required="true" />
> <field name="textbody" type="text" indexed="true" stored="true" /> 
> <field name="domain" type="string" indexed="true" stored="true"
> multiValued="true" omitNorms="true"/>
> <field name="subdomain" type="string" indexed="true" stored="true"  
> multiValued="true" omitNorms="true"/>
> <field name="category" type="string" indexed="true" stored="true"  
> multiValued="true" omitNorms="true"/>
> <field name="dt" type="date" indexed="true" stored="true"  
> multiValued="false"/>
> <field name="timestamp" type="date" indexed="true" stored="true"  
> default="NOW" multiValued="false"/>
> <field name="class" type="string" multiValued="true" indexed="true"
 />
> <field name="class_id" type="string" multiValued="true"  
> indexed="true" />
> <field name="tags" type="string" indexed="true" stored="true"  
> multiValued="true"/>
> <field name="level" type="sint" indexed="true" stored="true"  
> default="0"/>
> <field name="user" type="string" indexed="true" stored="true"/>
>
> Note, this is a completely unoptimized schema -- knocked out quick for 
> a proof of concept.  Many fields here will be used for faceting.
>
> The only field in this schema which will be more than a line or so 
> (call it 80 bytes) is the 'textbody' field which could be up to a few 
> KB - call it an average of 1K or less.  All "id/class" fields will be 
> 32B or less.
>
> The 100M docs @ 165GB is a projection from having indexed 1/500th of
  
> the intended dataset.  It will not vary by more than 25% and will not 
> grow over time (we will be removing entries older than X days as
  
> part of ongoing maintenance).
>
> The servers we're spec'ing are 8 core, 8 gig machines, with a SAN for 
> storage.  The servers will be load balanced for performance and 
> availability (i.e. if one box is dead, searches don't stop -- they 
> just slow down a bit).  Indexing will occur incrementally, as 
> transactions occur in a related set of applications.  There will 
> rarely be a need for a focused "indexing" process after the initial 
> app rollout.
>
> Thanks for any comments or suggestions.
>
> Al
>

RE: sizing/sanity check for huge(?) dataset

Reply via email to