Re: sizing/sanity check for huge(?) dataset

Alok Dhir Fri, 28 Dec 2007 12:40:52 -0800

Missed one bit of data: This dataset will be searched less than 500times per day. The goal is to get results in a reasonable amount oftime (<3s), but the queries coming in per minute will likely max outaround 10.


On Dec 28, 2007, at 3:36 PM, Alok Dhir wrote:

Hey all -- first, thanks to the solr & lucene teams for fantasticproducts. So far we're very pleased with the results we're seeingfrom them. We're looking at it as the primary search solution for arather large dataset. Hoping for a comments/sanity check frompeople who "know".
Looking at a deploying solr tp search around 100M docs, totallingaround 165G of space. Would this be considered "huge"? It seems sogiven the posts I've read on the list. In any case...
Schema currently looks as follows (using type definitions from"example" schema.xml):
<field name="instance" type="string" indexed="true" stored="true"required="true" /><field name="instance_id" type="string" indexed="true" stored="true"required="true" /><field name="id" type="string" indexed="true" stored="true"required="true" /><field name="label" type="text" indexed="true" stored="true"required="true" />
<field name="textbody" type="text" indexed="true" stored="true" />
<field name="domain" type="string" indexed="true" stored="true"multiValued="true" omitNorms="true"/><field name="subdomain" type="string" indexed="true" stored="true"multiValued="true" omitNorms="true"/><field name="category" type="string" indexed="true" stored="true"multiValued="true" omitNorms="true"/><field name="dt" type="date" indexed="true" stored="true"multiValued="false"/><field name="timestamp" type="date" indexed="true" stored="true"default="NOW" multiValued="false"/>
<field name="class" type="string" multiValued="true" indexed="true" />
<field name="class_id" type="string" multiValued="true"indexed="true" /><field name="tags" type="string" indexed="true" stored="true"multiValued="true"/><field name="level" type="sint" indexed="true" stored="true"default="0"/>
<field name="user" type="string" indexed="true" stored="true"/>
Note, this is a completely unoptimized schema -- knocked out quickfor a proof of concept. Many fields here will be used for faceting.
The only field in this schema which will be more than a line or so(call it 80 bytes) is the 'textbody' field which could be up to afew KB - call it an average of 1K or less. All "id/class" fieldswill be 32B or less.
The 100M docs @ 165GB is a projection from having indexed 1/500th ofthe intended dataset. It will not vary by more than 25% and willnot grow over time (we will be removing entries older than X days aspart of ongoing maintenance).
The servers we're spec'ing are 8 core, 8 gig machines, with a SANfor storage. The servers will be load balanced for performance andavailability (i.e. if one box is dead, searches don't stop -- theyjust slow down a bit). Indexing will occur incrementally, astransactions occur in a related set of applications. There willrarely be a need for a focused "indexing" process after the initialapp rollout.
Thanks for any comments or suggestions.

Al

Re: sizing/sanity check for huge(?) dataset

Reply via email to