Hi Toke, That was Spectacular, really great to hear that you have already indexed 2.7TB+ data to your server and still the query response time is under ms or a few seconds for such a huge dataset. Could you state what indexing mechanism are you using, as I started with EmbeddedSolrServer but it was pretty slow after a few GB(~30+) of indexing. I started indexing 1 week back and still its 37GB, although I assume HttpPost mechanism will perform lethargic slow due to network latency and for the response await. Furthermore I started with CloudSolrServer but facing some weird exception saying "ClassCastException Cannot cast to Exception" while adding the SolrInputDocument to the Server.
CloudSolrServer server1 = new CloudSolrServer("zkHost:port1,zkHost:port2,zkHost:port3",false); server1.setDefaultCollection("mycollection"); SolrInputDocument doc = new SolrInputDocument(); doc.addField( "ID", "123"); doc.addField( "A0_s", "282628854"); server1.add(doc); //Error at this line server1.commit(); Thanks again Toke for sharing that Stats. On Fri, Jun 6, 2014 at 5:04 PM, Toke Eskildsen <t...@statsbiblioteket.dk> wrote: > On Fri, 2014-06-06 at 12:32 +0200, Vineet Mishra wrote: > > *Does that mean for querying smoothly we need to have memory atleast > equal > > or greater to the size of index? > > If you absolutely, positively have to reduce latency as much as > possible, then yes. With an estimated index size of 2TB, I would guess > that 10-20 machines with powerful CPUs (1 per shard per expected > concurrent request) would also be advisable. While you're at it, do make > sure that you're using high-speed memory. > > That was not a serious suggestion, should you be in doubt. Very few > people need the best latency possible. Most just need the individual > searches to be "fast enough" and want to scale throughput instead. > > > As in my case the index size will be very heavy(~2TB) and practically > > speaking that amount of memory is not possible. Even If it goes to > > multiple shards, say around 10 Shards then also 200GB of RAM will not > > be an feasible option. > > We're building a projected 24TB index collection and are currently at > 2.7TB+, growing with about 1TB/10 days. Our current plan is to use a > single machine with 256GB of RAM, but we will of course adjust along the > way if it proves to be too small. > > Requirements differ with the corpus and the needs, but for us, SSDs as > storage seems to provide quite enough of a punch. I did a little testing > yesterday: https://plus.google.com/u/0/+TokeEskildsen/posts/4yPvzrQo8A7 > > tl;dr: for small result sets (< 1M hits) on unwarmed searches with > simple queries, response time is below 100ms. If we enable faceting with > plain Solr, this jumps to about 1 second. > > I did a top on the machine and it says that 50GB is currently used for > caching, so an 80GB (and probably less) machine would work fine for our > 2.7TB index. > > > - Toke Eskildsen, State and University Library, Denmark > > >