Sorry sent early. To explain further, the scheduler is entirely decentralized in the proposed design, and no node holds all the information you're talking about in heap at once (in fact no one node would ever hold that information). Each node is responsible only for tokens that they are "primary" replicas of. Then each token is split by tables and then each table range is individually split into subranges, into at most a few hundred range splits (typically one or two, you don't want too many otherwise you'll have too many small sstables) at a time. This is all at most megabytes of data, and I really do believe would not cause significant, if any, heap pressure. The repairs *themselves* certainly would create heap pressure, but that happens regardless of the scheduler.
-Joey On Thu, Apr 5, 2018 at 7:25 PM, Joseph Lynch <joe.e.ly...@gmail.com> wrote: > I wouldn't trivialize it, scheduling can end up dealing with more than a >> single repair. If theres 1000 keyspace/tables, with 400 nodes and 256 >> vnodes on each thats a lot of repairs to plan out and keep track of and can >> easily cause heap allocation spikes if opted in. >> >> Chris > > The current proposal never keeps track of more than a few hundred range > splits for a single table at a time, and nothing ever keeps state for the > entire 400 node Compared to the load generated by actually repairing the > data, I actually do think it is trivial heap pressure. > > > Somewhat beside the point, I wasn't aware there were any 100 node + > clusters running with vnodes, if my math is correct they would be > excessively vulnerable to outages with that many vnodes and that many > nodes. Most of the large clusters I've heard of (100 nodes plus) are > running with single or at most 4 tokens per node. >