Travis Low [t...@4centurion.com] wrote:
> Toke, thanks.  Comments embedded (hope that's okay):

Inline or top-posting? Long discussion, but for mailing lists I clearly prefer 
the former.

[Toke: Estimate characters]

> Yes.  We estimate each of the 23K DB records has 600 pages of text for the
> combined documents, 300 words per page, 5 characters per word.  Which
> coincidentally works out to about 21GB, so good guessing there. :)

Heh. Lucky Guess indeed, although the factors were off. Anyway, 21GB does not 
sound scary at all.

> The way it works is we have researchers modifying the DB records during the
> day, and they may upload documents at that time.  We estimate 50-60 uploads
> throughout the day.  If possible, we'd like to index them as they are
> uploaded, but if that would negatively affect the search, then we can
> rebuild the index nightly.
>
> Which is better?

The analyzing part is only CPU and you're running multi-core so as long as you 
only analyze using one thread you're safe there. That leaves us with I/O: Even 
for spinning drives, a daily load of just 60 updates of 1MB of extracted text 
each shouldn't have any real effect - with the usual caveat that large merges 
should be avoided by either optimizing at night or tweaking merge policy to 
avoid large segments. With such a relatively small index, (re)opening and warm 
up should be painless too.

Summary: 300GB is a fair amount of data and takes some power to crunch. 
However, in the Solr/Lucene end your index size and your update rates are 
nothing to worry about. Usual caveat for advanced use and all that applies.

[Toke: i7, 8GB, 1TB spinning, 256GB SSD]

> We have a very beefy VM server that we will use for benchmarking, but your
> specs provide a starting point.  Thanks very much for that.

I have little experience with VM servers for search. Although we use a lot of 
virtual machines, we use dedicated machines for our searchers, primarily to 
ensure low latency for I/O. They might be fine for that too, but we haven't 
tried it yet.

Glad to be of help,
Toke

Reply via email to