Hello, Gil,

I'm wondering if you've been in touch with the Hathi Trust people, because
I imagine your use cases are somewhat similar.

They've done some blogging around getting digitized texts indexed at scale,
which I what I assume you're doing:

http://www.hathitrust.org/blogs/Large-scale-Search

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Thu, Dec 12, 2013 at 5:10 AM, Hoggarth, Gil <gil.hogga...@bl.uk> wrote:

> Thanks for this - I haven't any previous experience with utilising SSDs in
> the way you suggest, so I guess I need to start learning! And thanks for
> the Danish-webscale URL, looks like very informed reading. (Yes, I think
> we're working in similar industries with similar constraints and
> expectations).
>
> Compiliing my answers into one email, " Curious how many documents per
> shard you were planning? The number of documents per shard and field type
> will drive the amount of a RAM needed to sort and facet."
> - Number of documents per shard, I think about 200 million. That's a bit
> of a rough estimate based on other Solrs we run though. Which I think means
> we hold a lot of data for each document, though I keep arguing to keep this
> to the truly required minimum. We also have many facets, some of which are
> pretty large (I'm stretching my understanding here but I think most
> documents have many 'entries' in many facets so these really hit us
> performance-wise.)
>
> I try to keep a 1-to-1 ratio of Solr nodes to CPUs with a few spare for
> the operating system. I utilise MMapDirectory to manage memory via the OS.
> So at this moment I guessing that we'll have 56 Solr dedicated CPUs across
> 2 physical 32 CPU servers and _hopefully_ 256GB RAM on each. This would
> give 28 shards and each would have 5GB java memory (in Tomcat), leaving
> 126GB on each server for the OS and MMap. (I believe the Solr theory for
> this doesn't accurately work out but we can accept the edge cases where
> this will fail.)
>
> I can also see that our hardware requirements will also depend on usage as
> well as the volume of data, and I've been pondering how best we can
> structure our index/es to facilitate a long term service (which means that,
> given it's a lot of data, I need to structure the data so that new usage
> doesn't require re-indexing.) But at this early stage, as people say, we
> need to prototype, test, profile etc. and to do that I need the hardware to
> run the trials (policy dictates that I buy the production hardware now,
> before profiling - I get to control much of the design and construction so
> I don't argue with this!)
>
> Thanks for all the comments everyone, all very much appreciated :)
> Gil
>
>
> -----Original Message-----
> From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
> Sent: 11 December 2013 12:02
> To: solr-user@lucene.apache.org
> Subject: Re: Solr hardware memory question
>
> On Tue, 2013-12-10 at 17:51 +0100, Hoggarth, Gil wrote:
> > We're probably going to be building a Solr service to handle a dataset
> > of ~60TB, which for our data and schema typically gives a Solr index
> > size of 1/10th - i.e., 6TB. Given there's a general rule about the
> > amount of hardware memory required should exceed the size of the Solr
> > index (exceed to also allow for the operating system etc.), how have
> > people handled this situation?
>
> By acknowledging that it is cheaper to buy SSDs instead of trying to
> compensate for slow spinning drives with excessive amounts of RAM.
>
> Our plans for an estimated 20TB of indexes out of 372TB of raw web data is
> to use SSDs controlled by a single machine with 512GB of RAM (or was it
> 256GB? I'll have to ask the hardware guys):
> https://sbdevel.wordpress.com/2013/12/06/danish-webscale/
>
> As always YMMW and the numbers you quite elsewhere indicates that your
> queries are quite complex. You might want to be a bit of profiling to see
> if they are heavy enough to make the CPU the bottleneck.
>
> Regards,
> Toke Eskildsen, State and University Library, Denmark
>
>
>

Reply via email to