On Tue, 2011-10-11 at 14:36 +0200, Travis Low wrote: > Greetings. I have a paltry 23,000 database records that point to a > voluminous 300GB worth of PDF, Word, Excel, and other documents. We are > planning on indexing the records and the documents they point to. I have no > clue on how we can calculate what kind of server we need for this. I > imagine the index isn't going to be bigger than the documents (is it?)
Sanity check: Let's say your average document is 200 pages with 1000 words of 5 characters each. That gives you 200 * 1000 * 5 * 23,000 ~= 21GB of raw text, which is a far cry from the 300GB. Either your documents are extremely text heavy or they contain illustrations and other elements that are not to be indexed. Is it possible for you to estimate the number of characters in your corpus? > But what kind of processing power and memory might we need? I am not well-versed in Tika and other PDF/Word/etc analyzing frameworks, so I'll just focus on the search part here. Guessing wildly, you're aiming for a low number of running updates or even just a nightly batch update. Response times should be below 200 ms and the number of concurrent searches is 2 to 4 at most. Bold claim: Assuming that your corpus is more 20GB of raw text than 300GB, you'll get by just fine with an i7 machine with 8GB of RAM, a 1TB 7200 RPM drive for storage and a 256GB consumer SSD for search. That is more or less what we use for our 10M documents/60GB+ index, with a load as I described above. I've always been wary of having to dictate hardware up front for such projects. It is a lot easier and cheaper to just build the software, then measure and buy hardware after that. Regards, Toke Eskildsen