Re. "I have little experience with VM servers for search." We had huge performance penalty on VMs, CPU was bottleneck. We couldn't freely run measurements to figure out what the problem really was (hosting was contracted by customer...), but it was something pretty scary, kind of 8-10 times slower than advertised dedicated equivalent. Whatever its worth, if you can afford it, keep lucene away from it. Lucene is highly optimized machine, and someone twiddling with context switches is not welcome there.
Of course, if you get IO bound, it makes no big diff anyhow. This is just my singular experience, might be the hosting team did not configure it right, or something changed in meantime (~ 4 Years old experience), but we burnt our fingers that hard I still remember it On Tue, Oct 11, 2011 at 7:49 PM, Toke Eskildsen <t...@statsbiblioteket.dk>wrote: > Travis Low [t...@4centurion.com] wrote: > > Toke, thanks. Comments embedded (hope that's okay): > > Inline or top-posting? Long discussion, but for mailing lists I clearly > prefer the former. > > [Toke: Estimate characters] > > > Yes. We estimate each of the 23K DB records has 600 pages of text for > the > > combined documents, 300 words per page, 5 characters per word. Which > > coincidentally works out to about 21GB, so good guessing there. :) > > Heh. Lucky Guess indeed, although the factors were off. Anyway, 21GB does > not sound scary at all. > > > The way it works is we have researchers modifying the DB records during > the > > day, and they may upload documents at that time. We estimate 50-60 > uploads > > throughout the day. If possible, we'd like to index them as they are > > uploaded, but if that would negatively affect the search, then we can > > rebuild the index nightly. > > > > Which is better? > > The analyzing part is only CPU and you're running multi-core so as long as > you only analyze using one thread you're safe there. That leaves us with > I/O: Even for spinning drives, a daily load of just 60 updates of 1MB of > extracted text each shouldn't have any real effect - with the usual caveat > that large merges should be avoided by either optimizing at night or > tweaking merge policy to avoid large segments. With such a relatively small > index, (re)opening and warm up should be painless too. > > Summary: 300GB is a fair amount of data and takes some power to crunch. > However, in the Solr/Lucene end your index size and your update rates are > nothing to worry about. Usual caveat for advanced use and all that applies. > > [Toke: i7, 8GB, 1TB spinning, 256GB SSD] > > > We have a very beefy VM server that we will use for benchmarking, but > your > > specs provide a starting point. Thanks very much for that. > > I have little experience with VM servers for search. Although we use a lot > of virtual machines, we use dedicated machines for our searchers, primarily > to ensure low latency for I/O. They might be fine for that too, but we > haven't tried it yet. > > Glad to be of help, > Toke