Hello, Gil, I'm wondering if you've been in touch with the Hathi Trust people, because I imagine your use cases are somewhat similar.
They've done some blogging around getting digitized texts indexed at scale, which I what I assume you're doing: http://www.hathitrust.org/blogs/Large-scale-Search Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions <https://twitter.com/Appinions> | g+: plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts> w: appinions.com <http://www.appinions.com/> On Thu, Dec 12, 2013 at 5:10 AM, Hoggarth, Gil <gil.hogga...@bl.uk> wrote: > Thanks for this - I haven't any previous experience with utilising SSDs in > the way you suggest, so I guess I need to start learning! And thanks for > the Danish-webscale URL, looks like very informed reading. (Yes, I think > we're working in similar industries with similar constraints and > expectations). > > Compiliing my answers into one email, " Curious how many documents per > shard you were planning? The number of documents per shard and field type > will drive the amount of a RAM needed to sort and facet." > - Number of documents per shard, I think about 200 million. That's a bit > of a rough estimate based on other Solrs we run though. Which I think means > we hold a lot of data for each document, though I keep arguing to keep this > to the truly required minimum. We also have many facets, some of which are > pretty large (I'm stretching my understanding here but I think most > documents have many 'entries' in many facets so these really hit us > performance-wise.) > > I try to keep a 1-to-1 ratio of Solr nodes to CPUs with a few spare for > the operating system. I utilise MMapDirectory to manage memory via the OS. > So at this moment I guessing that we'll have 56 Solr dedicated CPUs across > 2 physical 32 CPU servers and _hopefully_ 256GB RAM on each. This would > give 28 shards and each would have 5GB java memory (in Tomcat), leaving > 126GB on each server for the OS and MMap. (I believe the Solr theory for > this doesn't accurately work out but we can accept the edge cases where > this will fail.) > > I can also see that our hardware requirements will also depend on usage as > well as the volume of data, and I've been pondering how best we can > structure our index/es to facilitate a long term service (which means that, > given it's a lot of data, I need to structure the data so that new usage > doesn't require re-indexing.) But at this early stage, as people say, we > need to prototype, test, profile etc. and to do that I need the hardware to > run the trials (policy dictates that I buy the production hardware now, > before profiling - I get to control much of the design and construction so > I don't argue with this!) > > Thanks for all the comments everyone, all very much appreciated :) > Gil > > > -----Original Message----- > From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] > Sent: 11 December 2013 12:02 > To: solr-user@lucene.apache.org > Subject: Re: Solr hardware memory question > > On Tue, 2013-12-10 at 17:51 +0100, Hoggarth, Gil wrote: > > We're probably going to be building a Solr service to handle a dataset > > of ~60TB, which for our data and schema typically gives a Solr index > > size of 1/10th - i.e., 6TB. Given there's a general rule about the > > amount of hardware memory required should exceed the size of the Solr > > index (exceed to also allow for the operating system etc.), how have > > people handled this situation? > > By acknowledging that it is cheaper to buy SSDs instead of trying to > compensate for slow spinning drives with excessive amounts of RAM. > > Our plans for an estimated 20TB of indexes out of 372TB of raw web data is > to use SSDs controlled by a single machine with 512GB of RAM (or was it > 256GB? I'll have to ask the hardware guys): > https://sbdevel.wordpress.com/2013/12/06/danish-webscale/ > > As always YMMW and the numbers you quite elsewhere indicates that your > queries are quite complex. You might want to be a bit of profiling to see > if they are heavy enough to make the CPU the bottleneck. > > Regards, > Toke Eskildsen, State and University Library, Denmark > > >