On 19-Aug-08, at 12:58 PM, Phillip Farber wrote:
So you experience differs from Mike's. Obviously it's an important
decision as to whether to buy more machines. Can you (or Mike)
weigh in on what factors led to your different take on local shards
vs. shards distributed across machines?
I do both; the only reason I have two shards on each machine is to
squeeze maximum performance out of an equipment budget. Err on the
side of multiple machines.
At least for building the index, the number of shards really does
help. To index Medline (1.6e7 docs which is 60Gb in XML text) on a
single machine starts at about 100doc/s but slows down to 10doc/s
when
the index grows. It seems as though the limit is reached once you run
out of RAM and it gets slower and slower in a linear fashion the
larger the index you get.
My sweet spot was 5 machines with 8GB RAM for indexing about 60GB of
data.
Can you say what the specs were for these machines? Given that I
have more like 1TB of data over 1M docs how do you think my machine
requirements might be affected as compared to yours?
You are in a much better position to determine this than we are. See
how big an index you can put on a single machine while maintaining
acceptible performance using a typical query load. It's relatively
safe to extrapolate linearly from that.
-Mike