Sam,

These are big numbers you are throwing around, especially the query volume. 
How big are these records that you have 4 billion of -- or put another way,
how much space would it take up in a pure form like in CSV?  And should I
assume the searches you are doing are more than geospatial?  In any case, a
Solr solution here is going to involve many machines.  The biggest number
you propose is 10k queries per second which is hard to imagine.

I've seen some say Solr 4 might have 100M records per shard, although there
is a good deal variability -- as usual, YMMV.  But lets go with that for
this paper-napkin calculation.  You would need 40 shards of 100M documents
each to get to 4000M (4B) documents.  That is a lot of shards, but people
have done it, I believe.  This scales out to your document collection but
not up to your query volume which is extremely high.  I have some old
benchmarks suggesting ~10ms geo queries on spatial queries for SOLR-2155
which was rolled into the spatial code in Lucene 4 (Solr adapters are on the
way).  But for a full query overhead and for a safer estimate, lets say
50ms.  So perhaps you might get 20 concurrent queries per second (which
seems high but we'll go with it).  But you require 10k/sec(!) so this means
you need 500 times the 20qps which means 500 *times* the base hardware to
support the 40 shards I mentioned before.  In other words, the 4B documents
need to be replicated 500 times to support 10k/second queries.  So
theoretically, we're talking 500 clusters, each cluster having 40 shards --
at ~4 shards/machine this is 10 machines per cluster: 5,000 machines in
total.  Wow.  Doesn't seem realistic.  If you have a reference to some
system or person's experience with any system that can, Solr or not, then
please share.

If you or anyone were to attempt to see if Solr scale's for their needs, a
good approach is to consider just one shard non-replicated, or even better a
handful that would all exist on one machine.  Optimize it as much as you
can.  Then see how much data you can put on this machine and with what
query-volume.  From this point, it's basic math to see how many more such
machines are required to scale out to your data size and up to your query
volume.

Care to explain why so much data needs to be searched at such a volume? 
Maybe you work for Google ;-)

To your question on scalability vs PostGIS, I think Solr shines in its
ability to scale out if you have the resources to do it.

~ David Smiley

-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Spatial-Search-for-Specif-Areas-on-Map-tp3995051p3995197.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to