Travis Low [t...@4centurion.com] wrote: > Toke, thanks. Comments embedded (hope that's okay):
Inline or top-posting? Long discussion, but for mailing lists I clearly prefer the former. [Toke: Estimate characters] > Yes. We estimate each of the 23K DB records has 600 pages of text for the > combined documents, 300 words per page, 5 characters per word. Which > coincidentally works out to about 21GB, so good guessing there. :) Heh. Lucky Guess indeed, although the factors were off. Anyway, 21GB does not sound scary at all. > The way it works is we have researchers modifying the DB records during the > day, and they may upload documents at that time. We estimate 50-60 uploads > throughout the day. If possible, we'd like to index them as they are > uploaded, but if that would negatively affect the search, then we can > rebuild the index nightly. > > Which is better? The analyzing part is only CPU and you're running multi-core so as long as you only analyze using one thread you're safe there. That leaves us with I/O: Even for spinning drives, a daily load of just 60 updates of 1MB of extracted text each shouldn't have any real effect - with the usual caveat that large merges should be avoided by either optimizing at night or tweaking merge policy to avoid large segments. With such a relatively small index, (re)opening and warm up should be painless too. Summary: 300GB is a fair amount of data and takes some power to crunch. However, in the Solr/Lucene end your index size and your update rates are nothing to worry about. Usual caveat for advanced use and all that applies. [Toke: i7, 8GB, 1TB spinning, 256GB SSD] > We have a very beefy VM server that we will use for benchmarking, but your > specs provide a starting point. Thanks very much for that. I have little experience with VM servers for search. Although we use a lot of virtual machines, we use dedicated machines for our searchers, primarily to ensure low latency for I/O. They might be fine for that too, but we haven't tried it yet. Glad to be of help, Toke