On Tue, 2011-10-11 at 14:36 +0200, Travis Low wrote:
> Greetings.  I have a paltry 23,000 database records that point to a
> voluminous 300GB worth of PDF, Word, Excel, and other documents.  We are
> planning on indexing the records and the documents they point to.  I have no
> clue on how we can calculate what kind of server we need for this.  I
> imagine the index isn't going to be bigger than the documents (is it?)

Sanity check: Let's say your average document is 200 pages with 1000
words of 5 characters each. That gives you 200 * 1000 * 5 * 23,000 ~=
21GB of raw text, which is a far cry from the 300GB.

Either your documents are extremely text heavy or they contain
illustrations and other elements that are not to be indexed. Is it
possible for you to estimate the number of characters in your corpus?

>  But what kind of processing power and memory might we need?

I am not well-versed in Tika and other PDF/Word/etc analyzing
frameworks, so I'll just focus on the search part here. Guessing wildly,
you're aiming for a low number of running updates or even just a nightly
batch update. Response times should be below 200 ms and the number of
concurrent searches is 2 to 4 at most.

Bold claim: Assuming that your corpus is more 20GB of raw text than
300GB, you'll get by just fine with an i7 machine with 8GB of RAM, a 1TB
7200 RPM drive for storage and a 256GB consumer SSD for search. That is
more or less what we use for our 10M documents/60GB+ index, with a load
as I described above.

I've always been wary of having to dictate hardware up front for such
projects. It is a lot easier and cheaper to just build the software,
then measure and buy hardware after that.

Regards,
Toke Eskildsen

Reply via email to