Hi, Changes from the RFC: - Moved some fields from the per-qdisc data structure to the per schedule entry one, mainly "expires" (now called "close_time", when an entry ends) and "budget" (how many bytes can be sent during an entry); - Removed support for the schedule file, in favour of using iproute2 batch mode (only affects the iproute2 patches) (Jiri Pirko, Stephen Hemminger);
- Removed support for manually setting a cycle-time (it will be added in a later series); Original cover letter ===================== (lightly edited, updated references and usage) This series provides a set of interfaces that can be used by applications that require (time-based) Scheduled Transmission of packets. It is comprised by 3 new components to the kernel: - etf: the per-queue TxTime-Based scheduling qdisc; - taprio: the per-port Time-Aware scheduler qdisc; - SO_TXTIME: a socket option + cmsg APIs. ETF and SO_TXTIME are already applied[1] into the net-next tree. This is the remaining piece. Overview ======== The CBS qdisc proposal RFC [2] included some rough ideas about the design and API of a "taprio" (Time Aware Priority) qdisc. The idea of presenting the taprio ideas at that point (almost one year ago!) was to show our vision of how things would fit together going forward. >From that concept stage to this (almost) realised stage the main differences are: - As of now, taprio is a software only implementation of a schedule executor; - Instead of taprio centralising all the time based decisions, taprio can work together with ETF (the Earliest TxTime First), a qdisc meant to use the LaunchTime (or similar) feature of various network controllers; In a nutshell, taprio is a root qdisc that can execute a pre-defined schedule, etf is a qdisc that provides time based admission control and "earliest deadline first" dequeue mode, and SO_TXTIME is a socket option that is used for enabling a socket to be used for time-based Tx and configuring its parameters. taprio ====== This scheduler allows the network administrator to configure schedules for classes of traffic, the configuration interface is similar to what IEEE 802.1Q-2018 defines. Example configuration: $ tc qdisc add dev enp2s0 parent root handle 100 taprio \ num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \ queues 1@0 1@1 2@2 \ sched-entry S 01 300000 \ sched-entry S 02 300000 \ sched-entry S 04 300000 \ base-time 1528743495910289987 \ clockid CLOCK_TAI This qdisc borrows a few concepts from mqprio and so, most the parameters are similar to mqprio. The main difference is the sequence of 'sched-entry' parameters, that constitute one schedule: sched-entry S 01 300000 sched-entry S 02 300000 sched-entry S 04 300000 The format of each entry is: sched-entry <CMD> <GATE MASK> <INTERVAL> The only supported <CMD> is "S", which means "SetGateStates", following the IEEE 802.1Q-2018 definition (Table 8-7). <GATE MASK> is a bit-mask where each bit is a associated with a traffic class, so bit 0 (the least significant bit) being "on" means that traffic class 0 is "active" for that schedule entry. <INTERVAL> is a time duration in nanoseconds that specifies for how long that state defined by <CMD> and <GATE MASK> should be held before moving to the next entry. This schedule is circular, that is, after the last entry is executed it starts from the first one, indefinitely. The other parameters can be defined as follows: - base-time: allows that multiple systems can have synchronised schedules, it specifies the instant when the schedule starts; - clockid: specifies the reference clock to be used; A more complete example can be found here, with instructions of how to test it: https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f [3] The basic design of the scheduler is simple, after we calculate the first expiration of the hrtimer, we set the next expiration to be the previous plus the current entry's interval. At each time the function runs, we set the current_entry, which has a gate_mask (that controls which traffic classes are allowed to "go out" during each interval), and we reuse this callback to "kick" the qdisc (this is the reason that the usual qdisc watchdog isn't used). Future work =========== - Add support for multiple schedules, so something like the Admin and Oper schedules from IEEE 802.1Q-2018 can be implemented, probably "cycle-time" will be re-implemented at this time; - Add support for HW offloading; - Add support for Frame Preemption related commands (formerly 802.1Qbu, now part of 802.1Q); Known Issues ============ - As taprio is a software only implementation, and there's another layer of queuing in the network controller, packets can still leave the controller outside their "correct" windows. This happens mostly for low-priority classes, and only if they are 'starved' by the higher priority ones; This series is also hosted on github and can be found at [4]. The companion iproute2 patches can be found at [5]. Cheers, -- Vinicius [1] https://patchwork.ozlabs.org/cover/938991/ [2] https://patchwork.ozlabs.org/cover/808504/ [3] github doesn't make it clear, but the gist can be cloned like this: $ git clone https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f taprio-test [4] https://github.com/vcgomes/linux/tree/taprio-v1 [5] https://github.com/vcgomes/iproute2/tree/taprio-v1 Vinicius Costa Gomes (1): tc: Add support for configuring the taprio scheduler include/uapi/linux/pkt_sched.h | 46 ++ net/sched/Kconfig | 11 + net/sched/Makefile | 1 + net/sched/sch_taprio.c | 962 +++++++++++++++++++++++++++++++++ 4 files changed, 1020 insertions(+) create mode 100644 net/sched/sch_taprio.c -- 2.19.0