I wondered about that myself. What is the rule of thumb there --
would 16gb be comfortable?
Thanks for the response.
Al
On Dec 28, 2007, at 8:30 PM, Otis Gospodnetic wrote:
Hi Al,
165GB of disk space and 100M is big, but not impossibly huge.
The only thing that looks a bit worrisome is the index size : RAM
ratio.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Alok Dhir <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, December 28, 2007 3:40:22 PM
Subject: Re: sizing/sanity check for huge(?) dataset
Missed one bit of data: This dataset will be searched less than 500
times per day. The goal is to get results in a reasonable amount of
time (<3s), but the queries coming in per minute will likely max out
around 10.
On Dec 28, 2007, at 3:36 PM, Alok Dhir wrote:
Hey all -- first, thanks to the solr & lucene teams for fantastic
products. So far we're very pleased with the results we're seeing
from them. We're looking at it as the primary search solution for a
rather large dataset. Hoping for a comments/sanity check from
people who "know".
Looking at a deploying solr tp search around 100M docs, totalling
around 165G of space. Would this be considered "huge"? It seems so
given the posts I've read on the list. In any case...
Schema currently looks as follows (using type definitions from
"example" schema.xml):
<field name="instance" type="string" indexed="true" stored="true"
required="true" />
<field name="instance_id" type="string" indexed="true" stored="true"
required="true" />
<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="label" type="text" indexed="true" stored="true"
required="true" />
<field name="textbody" type="text" indexed="true" stored="true" />
<field name="domain" type="string" indexed="true" stored="true"
multiValued="true" omitNorms="true"/>
<field name="subdomain" type="string" indexed="true" stored="true"
multiValued="true" omitNorms="true"/>
<field name="category" type="string" indexed="true" stored="true"
multiValued="true" omitNorms="true"/>
<field name="dt" type="date" indexed="true" stored="true"
multiValued="false"/>
<field name="timestamp" type="date" indexed="true" stored="true"
default="NOW" multiValued="false"/>
<field name="class" type="string" multiValued="true" indexed="true"
/>
<field name="class_id" type="string" multiValued="true"
indexed="true" />
<field name="tags" type="string" indexed="true" stored="true"
multiValued="true"/>
<field name="level" type="sint" indexed="true" stored="true"
default="0"/>
<field name="user" type="string" indexed="true" stored="true"/>
Note, this is a completely unoptimized schema -- knocked out quick
for a proof of concept. Many fields here will be used for faceting.
The only field in this schema which will be more than a line or so
(call it 80 bytes) is the 'textbody' field which could be up to a
few KB - call it an average of 1K or less. All "id/class" fields
will be 32B or less.
The 100M docs @ 165GB is a projection from having indexed 1/500th of
the intended dataset. It will not vary by more than 25% and will
not grow over time (we will be removing entries older than X days as
part of ongoing maintenance).
The servers we're spec'ing are 8 core, 8 gig machines, with a SAN
for storage. The servers will be load balanced for performance and
availability (i.e. if one box is dead, searches don't stop -- they
just slow down a bit). Indexing will occur incrementally, as
transactions occur in a related set of applications. There will
rarely be a need for a focused "indexing" process after the initial
app rollout.
Thanks for any comments or suggestions.
Al