On Fri, Dec 16, 2016 at 5:55 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> You said that the data expires, but you haven't said > how many docs you need to host at a time. The data will expire in ~30 minutes average. Many of them are updates on the same document (this makes it worse because updates are delete+insert) > At 10M/second > inserts you'll.... need a boatload of shards. All of the > conversation about one beefy machine .vs. lots of not-so-beefy > machines should wait until you answer that question. Note, there will be multiple beefy machines(just less compared to small machines). I was asking what would be a big enough one, so 1 instance will be able to use the full server. > For > instance, indexing some _very_ simple documents on my > laptop can hit 10,000 docs/second/shard. So you're talking > 1,000 shards here. Indexing more complex docs I might > get 1,000 docs/second/shard, so then you'd need 10,000 > shards. Don't take these as hard numbers, I'm > just trying to emphasize that you'll need to do scaling > exercises to see if what you want to do is reasonable given > your constraints. > Of course. I think I've done ~80K/s/server on a previous project (it wasn't the bottleneck so didn't bother too much) but there are too many knobs that will change that number. > > If those 10M docs/second are bursty and you can stand some > latency, then that's one set of considerations. If it's steady-state > it's another. In either case you need some _serious_ design work > before you go forward. > I expect 10M to be the burst. But it needs to handle the burst. And I don't think I will do 10M requests, but small batches. > > And then you want to facet (fq clauses aren't nearly as expensive) > and want 2 second commit intervals. > It is what it is. > > You _really_ need to stand up some test systems and see what > performance you can get before launching off on trying to do the > whole thing. Fortunately, you can stand up, say, a 4 shard system > and tune it and drive it as fast as you possibly can and extrapolate > from there. > My ~thinking~ would be to have 1 instance/server + 1 shard/core + 1 thread for each shard. Assuming I remove all "blocking" disk operations (don't know if network is async?), it should be ~best~ scenario. I'll have to see how it functions more though. > > But to reiterate. This is a very high indexing rate that very few > organizations have attempted. You _really_ need to do a > proof-of-concept _then_ plan. > It's why I posted, to ask what have people used as big machine, and I would test on that. > > Here's the long form of this argument: > https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-th > e-abstract-why-we-dont-have-a-definitive-answer/ > > Best, > Erick > Thanks! > > On Fri, Dec 16, 2016 at 5:19 AM, GW <thegeofo...@gmail.com> wrote: > > Layer 2 bridge SAN is just for my Apache/apps on Conga so they can be > spun > > on up any host with a static IP. This has nothing to do with Solr which > is > > running on plain old hardware. > > > > Solrcloud is on a real cluster not on a SAN. > > > > The bit about dead with no error. I got this from a post I made asking > > about the best way to deploy apps. Was shown some code on making your app > > zookeeper aware. I am just getting to this so I'm talking from my ass. A > ZK > > aware program will have a list of nodes ready for business verses a plain > > old Round Robin. If data on a machine is corrupted you can get 0 docs > found > > while a ZK aware app will know that node is shite. > > > > > > > > > > > > > > > > On 16 December 2016 at 07:20, Dorian Hoxha <dorian.ho...@gmail.com> > wrote: > > > >> On Fri, Dec 16, 2016 at 12:39 PM, GW <thegeofo...@gmail.com> wrote: > >> > >> > Dorian, > >> > > >> > From my reading, my belief is that you just need some beefy machines > for > >> > your zookeeper ensemble so they can think fast. > >> > >> Zookeeper need to think fast enough for cluster state/changes. So I > think > >> it scales with the number of machines/collections/shards and not > documents. > >> > >> > After that your issues are > >> > complicated by drive I/O which I believe is solved by using shards. If > >> you > >> > have a collection running on top of a single drive array it should not > >> > compare to writing to a dozen drive arrays. So a whole bunch of light > >> duty > >> > machines that have a decent amount of memory and barely able process > >> faster > >> > than their drive I/O will serve you better. > >> > > >> My dataset will be lower than total memory, so I expect no query to hit > >> disk. > >> > >> > > >> > I think the Apache big data mandate was to be horizontally scalable to > >> > infinity with cheap consumer hardware. In my minds eye you are not > going > >> to > >> > get crazy input rates without a big horizontal drive system. > >> > > >> There is overhead with small machines, and with very big machines > (pricy). > >> So something in the middle. > >> So small cluster of big machines or big cluster of small machines. > >> > >> > > >> > I'm in the same boat. All the scaling and roll out documentation > seems to > >> > reference the Witch Doctor's secret handbook. > >> > > >> > I just started into making my applications ZK aware and really just > >> > starting to understand the architecture. After a whole year I still > feel > >> > weak while at the same time I have traveled far. I still feel like an > >> > amateur. > >> > > >> > My plans are to use bridge tools in Linux so all my machines are > sitting > >> on > >> > the switch with layer 2. Then use Conga to monitor which apps need to > be > >> > running. If a server dies, it's apps are spun up on one of the other > >> > servers using the original IP and mac address through a bridge > firewall > >> > gateway so there is no hold up with with mac phreaking like layer 3. > >> Layer > >> > 3 does not like to see a route change with a mac address. My apps > will be > >> > on a SAN ~ Data on as many shards/machines as financially possible. > >> > > >> By conga you mean https://sourceware.org/cluster/conga/spec/ ? > >> Also SAN may/will suck like someone answered in your thread. > >> > >> > > >> > I was going to put a bunch of Apache web servers in round robin to > talk > >> to > >> > Solr but discovered that a Solr node can be dead and not report > errors. > >> > > >> Please explain more "dead but no error". > >> > >> > It's all rough at the moment but it makes total sense to send Solr > >> requests > >> > based on what ZK says is available verses a round robin. > >> > > >> Yes, like I&other commenter wrote on your thread. > >> > >> > > >> > Will keep you posted on my roll out if you like. > >> > > >> > Best, > >> > > >> > GW > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > On 16 December 2016 at 03:31, Dorian Hoxha <dorian.ho...@gmail.com> > >> wrote: > >> > > >> > > Hello searchers, > >> > > > >> > > I'm researching solr for a project that would require a > >> > max-inserts(10M/s) > >> > > and some heavy facet+fq on top of that, though on low qps. > >> > > > >> > > And I'm trying to find blogs/slides where people have used some big > >> > > machines instead of hundreds of small ones. > >> > > > >> > > 1. Largest I've found is this > >> > > <https://sbdevel.wordpress.com/2016/11/30/70tb-16b-docs- > >> > > 4-machines-1-solrcloud/> > >> > > with 16cores + 384GB ram but they were using 25! solr4 instances / > >> server > >> > > which seems wasteful to me ? > >> > > > >> > > I know that 1 solr can have max ~29-30GB heap because GC is > >> > wasteful/sucks > >> > > after that, and you should leave the other amount to the os for > >> > file-cache. > >> > > 2. But do you think 1 instance will be able to fully-use a > 256GB/20core > >> > > machine ? > >> > > > >> > > 3. Like to share your findings/links with big-machine clusters ? > >> > > > >> > > Thank You > >> > > > >> > > >> >