http://linux-vserver.org/Capabilities_and_Flags
Capabilities and Flags
From Linux-VServer
In
computer science, a capability is a token used by a process to prove
that it is allowed to perform an operation on an object. The Linux
Capability System is based on "POSIX Capabilities", a somewhat
different concept, designed to split up the all powerful root privilege
into a set of distinct privileges.
[edit] The Capability/Flag System
[edit] POSIX Capabilities
A process has three sets of bitmaps called the inheritable(I),
permitted(P), and effective(E) capabilities. Each capability is
implemented as a bit in each of these bitmaps that is either set or
unset.
When a process tries to do a privileged operation, the
operating system will check the appropriate bit in the effective set of
the process (instead of checking whether the effective uid of the
process is 0 as is normally done).
For example, when a process tries to set the clock, the Linux
kernel will check that the process has the CAP_SYS_TIME bit (which is
currently bit 25) set in its effective set.
The permitted set of the process indicates the capabilities
the process can use. The process can have capabilities set in the
permitted set that are not in the effective set.
This indicates that the process has temporarily disabled this
capability. A process is allowed to set a bit in its effective set only
if it is available in the permitted set. The distinction between
effective and permitted exists so that processes can "bracket"
operations that need privilege.
The inheritable capabilities are the capabilities of the
current process that should be inherited by a program executed by the
current process. The permitted set of a process is masked against the
inheritable set during exec(). Nothing special happens during fork() or
clone(). Child processes and threads are given an exact copy of the
capabilities of the parent process.
The implementation in Linux stopped at this point, whereas
POSIX Capabilities require the addition of capability sets to files
too, to replace the SUID flag (at least for executables)
[edit] Upper Bound for Capabilities
Because the current Linux Capability system does not implement the
filesystem related portions of POSIX Capabilities which would make
setuid and setgid executables secure, and because it is much safer to
have a secure upper bound for all processes within a context, an
additional per-context capability mask has been added to limit all
processes belonging to that context to this mask. The meaning of the
individual caps (bits) of the capability bound mask is exactly the same
as with the permitted capability set.
[edit] Context Capabilities
As the Linux capabilities have almost reached the maximum number
that is possible without heavy modifications to the kernel, it was a
natural step to add a context-specific capability system.
The Linux-VServer context capability set acts as a mechanism to
fine tune existing Linux capabilities. It is not visible to the
processes within a context, as they would not know how to modify or
verify it.
In general there are two ways to use those capabilities:
- Require one or a number of context capabilities to be set in
addition to a given Linux capability, each one controlling a distinct
part of the functionality. For example the CAP_NET_ADMIN could be split
into RAW and PACKET sockets, so you could take away each of them
separately by not providing the required context capability.
- Consider the context capability sufficient for a specified
functionality, even if the Linux Capability says something different.
For example mount() requires CAP_SYS_ADMIN which adds a dozen other
things we do not want, so we define VXC_SECURE_MOUNT to allow mounts
for certain contexts.
The difference between the context flags and the context
capabilities is more an abstract logical separation than a functional
one, because they are handled in a very similar way.
[edit] List of capabilities/flags
Below is a list of capabilities and flags used for contexts and
processes within. The tables contain the following information:
- Bit
- The bit number to enable the capability/flag
- Mask
- The bit number in hexadecimal notation
- Name
- Human readable identifier used in userspace utilities
- Tag
- Special capability/flag code to denote special behaviour, legacy
usage and others (see below)
- Description
- Description of capability/flag effects
[edit] Special capability/flags codes
The tag column may contain one or more of the following tags:
| Tag
|
Description
|
| I
|
Internal use only
|
| L
|
Only supported with legacy enabled
|
| O
|
One time capability/flag (once it's cleared, it can't be
re-enabled again)
|
| U
|
Unsupported
|
| X
|
Slightly different meaning in legacy
|
[edit] Context capabilities (ccaps)
The set of available context capabilities is specific to
Linux-VServer and applied to all processes contained within a context.
Below is a list of capabilities currently available in 2.1.1 and above.
| Bit
|
Mask
|
Name
|
Tag
|
Description
|
| 0
|
0x00000001
|
SET_UTSNAME
|
|
Allow setdomainname(2) and sethostname(2)
|
| 1
|
0x00000002
|
SET_RLIMIT
|
|
Allow setrlimit(2)
|
| 8
|
0x00000100
|
RAW_ICMP
|
|
Allow usage of raw ICMP sockets
|
| 12
|
0x00001000
|
SYSLOG
|
|
Allow syslog(2)
|
| 16
|
0x00010000
|
SECURE_MOUNT
|
|
Allow secure mount(2)
|
| 17
|
0x00020000
|
SECURE_REMOUNT
|
|
Allow secure remount
|
| 18
|
0x00040000
|
BINARY_MOUNT
|
|
Allow binary/network mounts
|
| 20
|
0x00100000
|
QUOTA_CTL
|
|
Allow quota ioctls
|
| 21
|
0x00200000
|
ADMIN_MAPPER
|
|
Allow access to device mapper
|
| 22
|
0x00400000
|
ADMIN_CLOOP
|
|
Allow access to loop devices
|
| 24
|
0x01000000
|
KTHREAD
|
|
Allow creating kernel threads
|
[edit] Context flags (cflags)
The set of available context flags is specific to Linux-VServer and
applied to all processes contained within a context. Below is a list of
flags available in 2.1.1 and above.
| Bit
|
Mask
|
Name
|
Tag
|
Description
|
| 0
|
0x00000001
|
INFO_LOCK
|
L
|
Prohibit further context migration
|
| 1
|
0x00000002
|
INFO_SCHED
|
L
|
Account all processes as one
|
| 2
|
0x00000004
|
INFO_NPROC
|
L
|
Apply process limits to context
|
| 3
|
0x00000008
|
INFO_PRIVATE
|
L
|
Context cannot be entered
|
| 4
|
0x00000010
|
INFO_INIT
|
X
|
Show a fake init process
|
| 5
|
0x00000020
|
INFO_HIDE
|
X
|
Hide context information in task status
|
| 6
|
0x00000040
|
INFO_ULIMIT
|
L
|
Apply ulimits to context
|
| 7
|
0x00000080
|
INFO_NSPACE
|
L
|
Use private namespace
|
| 8
|
0x00000100
|
SCHED_HARD
|
|
Enable hard scheduler
|
| 9
|
0x00000200
|
SCHED_PRIO
|
|
Enable priority scheduler
|
| 10
|
0x00000400
|
SCHED_PAUSE
|
|
Pause context (unschedule)
|
| 20
|
0x00010000
|
VIRT_MEM
|
|
Virtualize memory information
|
| 21
|
0x00020000
|
VIRT_UPTIME
|
|
Virtualize uptime information
|
| 22
|
0x00040000
|
VIRT_CPU
|
|
Virtualize cpu usage information
|
| 23
|
0x00080000
|
VIRT_LOAD
|
|
Virtualize load average information
|
| 24
|
0x00100000
|
VIRT_TIME
|
|
Allow per guest time offsets
|
| 28
|
0x01000000
|
HIDE_MOUNT
|
|
Hide entries in /proc/$pid/mounts
|
| 29
|
0x02000000
|
HIDE_NETIF
|
|
Hide foreign network interfaces
|
| 30
|
0x04000000
|
HIDE_VINFO
|
|
Hide context information in task status
|
| 32
|
0x0001<<32
|
STATE_SETUP
|
IO
|
Enable setup state
|
| 33
|
0x0002<<32
|
STATE_INIT
|
IO
|
Enable init state
|
| 34
|
0x0004<<32
|
STATE_ADMIN
|
O
|
Enable admin state
|
| 36
|
0x0010<<32
|
SC_HELPER
|
I
|
Enable state change helper
|
| 37
|
0x0020<<32
|
REBOOT_KILL
|
|
Kill all processes on reboot(2)
|
| 38
|
0x0040<<32
|
PERSISTENT
|
|
Make context persistent
|
| 48
|
0x0001<<48
|
FORK_RSS
|
|
Block fork when RSS limit is exceeded
|
| 49
|
0x0002<<48
|
PROLIFIC
|
|
Allow context to create new contexts
|
| 52
|
0x0010<<48
|
IGNEG_NICE
|
|
Ignore priority raise
|
[edit] Network context flags (nflags)
The set of available network context flags is specific to
Linux-VServer and applied to all processes contained within a network
context. Below is a list of flags available in 2.1.1 and above.
| Bit
|
Mask
|
Name
|
Tag
|
Description
|
| 0
|
0x00000001
|
INFO_LOCK
|
|
Prohibit further context migration
|
| 8
|
0x00000100
|
SINGLE_IP
|
|
Enable special handling of network contexts with a single IP
only
|
| 9
|
0x00000200
|
LBACK_REMAP
|
|
use loopback-virtualisation (will only work in 2.3.0.xx or
greater)
|
| 10
|
0x00000400
|
LBACK_ALLOW
|
|
if set, allows guests without LBACK_REMAP to connect to
127.0.0.0/8
|
| 29
|
0x02000000
|
HIDE_NETIF
|
|
Hide foreign network interfaces
|
| 30
|
0x04000000
|
HIDE_LBACK
|
|
hides the real loopback-address from the guest (rewrites to
127.0.0.1) (will only work in 2.3.0.xx or greater)
|
| 32
|
0x0001<<32
|
STATE_SETUP
|
IO
|
Enable setup state
|
| 34
|
0x0004<<32
|
STATE_ADMIN
|
O
|
Enable admin state
|
| 36
|
0x0010<<32
|
SC_HELPER
|
I
|
Enable state change helper
|
| 38
|
0x0040<<32
|
PERSISTENT
|
|
Make network context persistent
|
[edit] System capabilities (bcaps)
The set of available system capabilities is inherited from the Linux
kernel and applied to all processes contained within a context. Below
is a list of capabilities currently available in the vanilla kernel.
BIG FAT WARNING: Adding any system capability to your virtual server
WILL reduce security. Do not change the default values unless you
absolutely know what you are doing!
| Bit
|
Mask
|
Name
|
Description
|
| 0
|
0x00000001
|
CHOWN
|
In a system with the [_POSIX_CHOWN_RESTRICTED] option
defined, this overrides the restriction of changing file ownership and
group ownership.
|
| 1
|
0x00000002
|
DAC_OVERRIDE
|
Override all DAC access, including ACL execute access if
[_POSIX_ACL] is defined. Excluding DAC access covered by
CAP_LINUX_IMMUTABLE.
|
| 2
|
0x00000004
|
DAC_READ_SEARCH
|
Overrides all DAC restrictions regarding read and search on
files and directories, including ACL restrictions if [_POSIX_ACL] is
defined. Excluding DAC access covered by CAP_LINUX_IMMUTABLE.
|
| 3
|
0x00000008
|
FOWNER
|
Overrides all restrictions about allowed operations on
files,
where file owner ID must be equal to the user ID, except where
CAP_FSETID is applicable. It doesn't override MAC and DAC restrictions.
|
| 4
|
0x00000010
|
FSETID
|
Overrides the following restrictions that the effective user
ID shall match the file owner ID when setting the S_ISUID and S_ISGID
bits on that file; that the effective group ID (or one of the
supplementary group IDs) shall match the file owner ID when setting the
S_ISGID bit on that file; that the S_ISUID and S_ISGID bits are cleared
on successful return from chown(2) (not implemented).
|
| 5
|
0x00000020
|
KILL
|
Overrides the restriction that the real or effective user ID
of a process sending a signal must match the real or effective user ID
of the process receiving the signal.
|
| 6
|
0x00000040
|
SETGID
|
- Allows setgid(2) manipulation
- Allows setgroups(2)
- Allows forged gids on socket credentials passing.
|
| 7
|
0x00000080
|
SETUID
|
- Allows set*uid(2) manipulation (including fsuid).
- Allows forged pids on socket credentials passing.
|
| 8
|
0x00000100
|
SETPCAP
|
Transfer any capability in your permitted set to any pid,
remove any capability in your permitted set from any pid
|
| 9
|
0x00000200
|
LINUX_IMMUTABLE
|
Allow modification of S_IMMUTABLE and S_APPEND file
attributes
|
| 10
|
0x00000400
|
NET_BIND_SERVICE
|
- Allows binding to TCP/UDP sockets below 1024
- Allows binding to ATM VCIs below 32
|
| 11
|
0x00000800
|
NET_BROADCAST
|
Allow broadcasting, listen to multicast
|
| 12
|
0x00001000
|
NET_ADMIN
|
- Allow interface configuration
- Allow administration of IP firewall, masquerading and
accounting
- Allow setting debug option on sockets
- Allow modification of routing tables
- Allow setting arbitrary process / process group ownership
on sockets
- Allow binding to any address for transparent proxying
- Allow setting TOS (type of service)
- Allow setting promiscuous mode
- Allow clearing driver statistics
- Allow multicasting
- Allow read/write of device-specific registers
- Allow activation of ATM control sockets
|
| 13
|
0x00002000
|
NET_RAW
|
- Allow use of RAW sockets
- Allow use of PACKET sockets
|
| 14
|
0x00004000
|
IPC_LOCK
|
- Allow locking of shared memory segments
- Allow mlock and mlockall (which doesn't really have
anything to do with IPC)
|
| 15
|
0x00008000
|
IPC_OWNER
|
Override IPC ownership checks
|
| 16
|
0x00010000
|
SYS_MODULE
|
- Insert and remove kernel modules - modify kernel without
limit
- Modify cap_bset
|
| 17
|
0x00020000
|
SYS_RAWIO
|
- Allow ioperm/iopl access
- Allow sending USB messages to any device via /proc/bus/usb
|
| 18
|
0x00040000
|
SYS_CHROOT
|
Allow use of chroot()
|
| 19
|
0x00080000
|
SYS_PTRACE
|
Allow ptrace() of any process
|
| 20
|
0x00100000
|
SYS_PACCT
|
Allow configuration of process accounting
|
| 21
|
0x00200000
|
SYS_ADMIN
|
- Allow configuration of the secure attention key
- Allow administration of the random device
- Allow examination and configuration of disk quotas
- Allow configuring the kernel's syslog (printk behaviour)
- Allow setting the domainname
- Allow setting the hostname
- Allow calling bdflush()
- Allow mount() and umount(), setting up new smb connection
- Allow some autofs root ioctls
- Allow nfsservctl
- Allow VM86_REQUEST_IRQ
- Allow to read/write pci config on alpha
- Allow irix_prctl on mips (setstacksize)
- Allow flushing all cache on m68k (sys_cacheflush)
- Allow removing semaphores (Used instead of CAP_CHOWN to
"chown" IPC message queues, semaphores and shared memory)
- Allow locking/unlocking of shared memory segment
- Allow turning swap on/off
- Allow forged pids on socket credentials passing
- Allow setting readahead and flushing buffers on block
devices
- Allow setting geometry in floppy driver
- Allow turning DMA on/off in xd driver
- Allow administration of md devices (mostly the above, but
some extra ioctls)
- Allow tuning the ide driver
- Allow access to the nvram device
- Allow administration of apm_bios, serial and bttv (TV)
device
- Allow manufacturer commands in isdn CAPI support driver
- Allow reading non-standardized portions of pci
configuration space
- Allow DDI debug ioctl on sbpcd driver
- Allow setting up serial ports
- Allow sending raw qic-117 commands
- Allow enabling/disabling tagged queuing on SCSI
controllers and sending arbitrary SCSI commands
- Allow setting encryption key on loopback filesystem
- Allow setting zone reclaim policy
|
| 22
|
0x00400000
|
SYS_BOOT
|
Allow use of reboot()
|
| 23
|
0x00800000
|
SYS_NICE
|
- Allow raising priority and setting priority on other
(different UID) processes
- Allow use of FIFO and round-robin (realtime) scheduling on
own processes and setting the scheduling algorithm used by another
process.
- Allow setting cpu affinity on other processes
|
| 24
|
0x01000000
|
SYS_RESOURCE
|
- Override resource limits. Set resource limits.
- Override quota limits.
- Override reserved space on ext2 filesystem
- Modify data journaling mode on ext3 filesystem (uses
journaling resources)
- NOTE: ext2 honors fsuid when checking for resource
overrides, so you can override using fsuid too
- Override size restrictions on IPC message queues
- Allow more than 64hz interrupts from the real-time clock
- Override max number of consoles on console allocation
- Override max number of keymaps
|
| 25
|
0x02000000
|
SYS_TIME
|
- Allow manipulation of system clock
- Allow irix_stime on mips
- Allow setting the real-time clock
|
| 26
|
0x04000000
|
SYS_TTY_CONFIG
|
- Allow configuration of tty devices
- Allow vhangup() of tty
|
| 27
|
0x08000000
|
MKNOD
|
Allow the privileged aspects of mknod()
|
| 28
|
0x10000000
|
LEASE
|
Allow taking of leases on files
|
| 29
|
0x20000000
|
AUDIT_WRITE
|
??
|
| 30
|
0x40000000
|
AUDIT_CONTROL
|
??
|
[edit] Setting flags and capabilities
To see how to set the flags and capabilities, see util-vserver:Capabilities
and Flags if you're using util-vserver.