Have you considered pre-supposing SolrCloud and using the SPLITSHARD API command? Even after that's done, the sub-shard needs to be physically moved to another machine (probably), but that too could be scripted.
May not be desirable, but I thought I'd mention it. Best, Erick On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge <peter.stu...@gmail.com> wrote: > Yes, totally agree. We run 500m+ docs in a (non-cloud) Solr4, and it even > performs reasonably well on commodity hardware with lots of faceting and > concurrent indexing! Ok, you need a lot of RAM to keep faceting happy, but > it works. > > ++1 for the automagic shard creator. We've been looking into doing this > sort of thing internally - i.e. when a shard reaches a certain size/num > docs, it creates 'sub-shards' to which new commits are sent and queries to > the 'parent' shard are included. The concept works, as long as you don't > try any non-dist stuff - it's one reason why all our fields are always > single valued. There are also other implications like cleanup, deletes and > security to take into account, to name a few. > A cool side-effect of sub-sharding (for lack of a snappy term) is that the > parent shard then stops suffering from auto-warming latency due to commits > (we do a fair amount of committing). In theory, you could carry on > sub-sharding until your hardware starts gasping for air. > > > On Sun, Jan 4, 2015 at 1:44 PM, Bram Van Dam <bram.van...@intix.eu> wrote: > >> On 01/04/2015 02:22 AM, Jack Krupansky wrote: >> >>> The reality doesn't seem to >>> be there today. 50 to 100 million documents, yes, but beyond that takes >>> some kind of "heroic" effort, whether a much beefier box, very careful and >>> limited data modeling or limiting of query capabilities or tolerance of >>> higher latency, expert tuning, etc. >>> >> >> I disagree. On the scale, at least. Up until 500M Solr performs "well" >> (read: well enough considering the scale) in a single shard on a single box >> of commodity hardware. Without any tuning or heroic efforts. Sure, some >> queries aren't as snappy as you'd like, and sure, indexing and querying at >> the same time will be somewhat unpleasant, but it will work, and it will >> work well enough. >> >> Will it work for thousands of concurrent users? Of course not. Anyone who >> is after that sort of thing won't find themselves in this scenario -- they >> will throw hardware at the problem. >> >> There is something to be said for making sharding less painful. It would >> be nice if, for instance, Solr would automagically create a new shard once >> some magic number was reached (2B at the latest, I guess). But then that'll >> break some query features ... :-( >> >> The reason we're using single large instances (sometimes on beefy >> hardware) is that SolrCloud is a pain. Not just from an administrative >> point of view (though that seems to be getting better, kudos for that!), >> but mostly because some queries cannot be executed with distributed=true. >> Our users, at least, prefer a slow query over an impossible query. >> >> Actually, this 2B limit is a good thing. It'll help me convince >> $management to donate some of our time to Solr :-) >> >> - Bram >>