Missed one bit of data: This dataset will be searched less than 500 times per day. The goal is to get results in a reasonable amount of time (<3s), but the queries coming in per minute will likely max out around 10.

On Dec 28, 2007, at 3:36 PM, Alok Dhir wrote:

Hey all -- first, thanks to the solr & lucene teams for fantastic products. So far we're very pleased with the results we're seeing from them. We're looking at it as the primary search solution for a rather large dataset. Hoping for a comments/sanity check from people who "know".

Looking at a deploying solr tp search around 100M docs, totalling around 165G of space. Would this be considered "huge"? It seems so given the posts I've read on the list. In any case...

Schema currently looks as follows (using type definitions from "example" schema.xml):

<field name="instance" type="string" indexed="true" stored="true" required="true" /> <field name="instance_id" type="string" indexed="true" stored="true" required="true" /> <field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="label" type="text" indexed="true" stored="true" required="true" />
<field name="textbody" type="text" indexed="true" stored="true" />
<field name="domain" type="string" indexed="true" stored="true" multiValued="true" omitNorms="true"/> <field name="subdomain" type="string" indexed="true" stored="true" multiValued="true" omitNorms="true"/> <field name="category" type="string" indexed="true" stored="true" multiValued="true" omitNorms="true"/> <field name="dt" type="date" indexed="true" stored="true" multiValued="false"/> <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
<field name="class" type="string" multiValued="true" indexed="true" />
<field name="class_id" type="string" multiValued="true" indexed="true" /> <field name="tags" type="string" indexed="true" stored="true" multiValued="true"/> <field name="level" type="sint" indexed="true" stored="true" default="0"/>
<field name="user" type="string" indexed="true" stored="true"/>

Note, this is a completely unoptimized schema -- knocked out quick for a proof of concept. Many fields here will be used for faceting.

The only field in this schema which will be more than a line or so (call it 80 bytes) is the 'textbody' field which could be up to a few KB - call it an average of 1K or less. All "id/class" fields will be 32B or less.

The 100M docs @ 165GB is a projection from having indexed 1/500th of the intended dataset. It will not vary by more than 25% and will not grow over time (we will be removing entries older than X days as part of ongoing maintenance).

The servers we're spec'ing are 8 core, 8 gig machines, with a SAN for storage. The servers will be load balanced for performance and availability (i.e. if one box is dead, searches don't stop -- they just slow down a bit). Indexing will occur incrementally, as transactions occur in a related set of applications. There will rarely be a need for a focused "indexing" process after the initial app rollout.

Thanks for any comments or suggestions.

Al


Reply via email to