Hi Al, 165GB of disk space and 100M is big, but not impossibly huge.
The only thing that looks a bit worrisome is the index size : RAM ratio. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: Alok Dhir <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Friday, December 28, 2007 3:40:22 PM Subject: Re: sizing/sanity check for huge(?) dataset Missed one bit of data: This dataset will be searched less than 500 times per day. The goal is to get results in a reasonable amount of time (<3s), but the queries coming in per minute will likely max out around 10. On Dec 28, 2007, at 3:36 PM, Alok Dhir wrote: > Hey all -- first, thanks to the solr & lucene teams for fantastic > products. So far we're very pleased with the results we're seeing > from them. We're looking at it as the primary search solution for a > rather large dataset. Hoping for a comments/sanity check from > people who "know". > > Looking at a deploying solr tp search around 100M docs, totalling > around 165G of space. Would this be considered "huge"? It seems so > given the posts I've read on the list. In any case... > > Schema currently looks as follows (using type definitions from > "example" schema.xml): > > <field name="instance" type="string" indexed="true" stored="true" > required="true" /> > <field name="instance_id" type="string" indexed="true" stored="true" > required="true" /> > <field name="id" type="string" indexed="true" stored="true" > required="true" /> > <field name="label" type="text" indexed="true" stored="true" > required="true" /> > <field name="textbody" type="text" indexed="true" stored="true" /> > <field name="domain" type="string" indexed="true" stored="true" > multiValued="true" omitNorms="true"/> > <field name="subdomain" type="string" indexed="true" stored="true" > multiValued="true" omitNorms="true"/> > <field name="category" type="string" indexed="true" stored="true" > multiValued="true" omitNorms="true"/> > <field name="dt" type="date" indexed="true" stored="true" > multiValued="false"/> > <field name="timestamp" type="date" indexed="true" stored="true" > default="NOW" multiValued="false"/> > <field name="class" type="string" multiValued="true" indexed="true" /> > <field name="class_id" type="string" multiValued="true" > indexed="true" /> > <field name="tags" type="string" indexed="true" stored="true" > multiValued="true"/> > <field name="level" type="sint" indexed="true" stored="true" > default="0"/> > <field name="user" type="string" indexed="true" stored="true"/> > > Note, this is a completely unoptimized schema -- knocked out quick > for a proof of concept. Many fields here will be used for faceting. > > The only field in this schema which will be more than a line or so > (call it 80 bytes) is the 'textbody' field which could be up to a > few KB - call it an average of 1K or less. All "id/class" fields > will be 32B or less. > > The 100M docs @ 165GB is a projection from having indexed 1/500th of > the intended dataset. It will not vary by more than 25% and will > not grow over time (we will be removing entries older than X days as > part of ongoing maintenance). > > The servers we're spec'ing are 8 core, 8 gig machines, with a SAN > for storage. The servers will be load balanced for performance and > availability (i.e. if one box is dead, searches don't stop -- they > just slow down a bit). Indexing will occur incrementally, as > transactions occur in a related set of applications. There will > rarely be a need for a focused "indexing" process after the initial > app rollout. > > Thanks for any comments or suggestions. > > Al >