On 19-Aug-08, at 12:58 PM, Phillip Farber wrote:

So you experience differs from Mike's. Obviously it's an important decision as to whether to buy more machines. Can you (or Mike) weigh in on what factors led to your different take on local shards vs. shards distributed across machines?

I do both; the only reason I have two shards on each machine is to squeeze maximum performance out of an equipment budget. Err on the side of multiple machines.

At least for building the index, the number of shards really does
help. To index Medline (1.6e7 docs which is 60Gb in XML text) on a
single machine starts at about 100doc/s but slows down to 10doc/s when
the index grows. It seems as though the limit is reached once you run
out of RAM and it gets slower and slower in a linear fashion the
larger the index you get.
My sweet spot was 5 machines with 8GB RAM for indexing about 60GB of
data.

Can you say what the specs were for these machines? Given that I have more like 1TB of data over 1M docs how do you think my machine requirements might be affected as compared to yours?

You are in a much better position to determine this than we are. See how big an index you can put on a single machine while maintaining acceptible performance using a typical query load. It's relatively safe to extrapolate linearly from that.

-Mike

Reply via email to