Hi,

To add to what Erik wrote - keep in mind you can compress data before 
indexing/storing it in Solr so, assuming those PDFs are not compressed under 
the hood, even if you store your fields for highlighting or other purposes, the 
resulting index may be smaller than raw PDFs, if you compress the fields.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


>________________________________
>From: Erik Hatcher <erik.hatc...@gmail.com>
>To: solr-user@lucene.apache.org
>Sent: Tuesday, October 11, 2011 9:49 AM
>Subject: Re: capacity planning
>
>Travis -
>
>Whether the index is bigger than the original content depends on what you need 
>to do with it in Solr.  One of the primary deciding factors is if you need to 
>use highlighting, which currently requires the fields to be highlighted be 
>stored.  Stored fields will take up about the same space as the original 
>documents (text-wise, likely a bit smaller than, say, the actual Word doc 
>itself).  If you don't need highlighting or the contents stored for other 
>purposes, then you'll have a dramatically smaller index than the original 
>(roughly 35% the size, generally).
>
>    Erik
>
>
>On Oct 11, 2011, at 08:36 , Travis Low wrote:
>
>> Greetings.  I have a paltry 23,000 database records that point to a
>> voluminous 300GB worth of PDF, Word, Excel, and other documents.  We are
>> planning on indexing the records and the documents they point to.  I have no
>> clue on how we can calculate what kind of server we need for this.  I
>> imagine the index isn't going to be bigger than the documents (is it?) so I
>> suppose 1TB is a starting point for disk space.  But what kind of processing
>> power and memory might we need?  Can anyone please point me in the right
>> direction?
>
>
>
>

Reply via email to