NAME | SYNOPSIS | DESCRIPTION | OPTIONS | EXIT STATUS | NOTES | EXAMPLES | AUTHORS | SEE ALSO | COLOPHON |
|
NSCREATE(1) User Commands NSCREATE(1)
nscreate - run program in new namespaces
nscreate [options] [program [arguments]]
The nscreate command creates new namespaces (as specified by the command-line options described below) and then executes the specified program with arguments. nsenter provides two modes of operation. The default mode uses the clone(2) system call to create a child process that is placed in the new namespaces and which executes program. The other mode, employed when the --unshare option is specified, uses unshare(2) to create the new namespaces and then directly executes program. By default, a new namespace remains in existence only as long as it has at least one member process. A namespace can be made persistent— that is, pinned into existence even when it has no member processes— by bind mounting the corresponding /proc/PID/ns/ns-type file. nscreate provides command-line options to simplify the creation of such bind mounts. A persistent namespace can later be entered using nsenter(1), even after program has terminated. A persistent namespace can be unpinned by unmounting the bind mount. If program is not supplied, then the program identified by the SHELL environment variable is run; if SHELL is not defined, then /bin/sh is executed. The following types of namespaces can be created using nscreate: cgroup namespace Cgroup namespaces virtualize the view of cgroups seen in /proc/[pid]/cgroup and /proc/[pid]/mountinfo. For further details, see cgroup_namespaces(7) and cgroups(7). IPC namespace Processes within an IPC namespace have private instances of certain interprocess communication resources, namely System V IPC objects (message queues, semaphores, shared memory) and POSIX message queues. For further details, see ipc_namespaces(7). mount namespace Processes within a mount namespace share a set of mount points. Processes in different mount namespaces thus see distinct single directory hierarchies. Mounting and unmounting filesystems in one mount namespace will not affect processes in other mount namespaces, except where a mount point has shared propagation. For further details, see mount_namespaces(7), mount(2), mount(8), and the kernel source file Documentation/filesystems/sharedsubtree.txt. network namespace Processes in a network network share private instances of various networking resources, such as networking devices, IPv4 and IPv6 protocol stacks, routing tables, firewall rules, and socket port numbers. Thus, for example, each network namespace can have its own (virtual) network device with its own IP address, and each network namespace can have a web server running on port 80. For further details, see network_namespaces(7). PID namespace PID namespaces isolate the PID number space, meaning that the PIDs of processes within a PID namespace are private to that namespace. For further details, see pid_namespaces(7). time namespace Time namespaces virtualize the values of certain system clocks, namely the boot-time and monotonic clocks. Thus, the processes within a time namespace share the same values for these clocks, but the values of the clocks may be different in other time namespaces. For further details, see time_namespaces(7). user namespace User namespaces virtualize certain security-related identifiers and attributes, such as user IDs, group IDs, and capabilities. Practically speaking, this means that a process may have certain credentials—for example, UID and GID 0, and all capabilities (i.e., superuser powers)—inside a user namespace, while at the same time having nonzero credentials and no capabilities outside that user namespace. For further details, see user_namespaces(7), capabilities(7), and credentials(7). UTS namespace Processes within a UTS namespace share a private instance of two system identifiers: the hostname and the NIS domain name. For further details, see uts_namespaces(7).
Options for creating namespaces The following options can be used to create new namespaces. The short-form options take no argument. The long-form options take an optional argument, which is the pathname of an existing file that will be used as the target when creating a bind mount in order to make the namespace persistent. -c, --cgroup[=pathname] Create a new cgroup namespace. If pathname is specified, then the namespace is made persistent by bind mounting the corresponding /proc/PID/ns/cgroup magic link on the regular file specified by pathname. -i, --ipc[=pathname] Create a new IPC namespace. If pathname is specified, then the namespace is made persistent by bind mounting the corresponding /proc/PID/ns/ipc magic link on the regular file specified by pathname. -m, --mount[=pathname] Create a new mount namespace. If pathname is specified, then the namespace is made persistent by bind mounting the corresponding /proc/PID/ns/mnt magic link on the regular file specified by pathname. Note that creating this bind mount will fail if the propagation type of the parent mount of pathname is shared. (The kernel disallows creation of the bind mount in this scenario because propagation of the mount point might lead to a circular dependency that would mean that the mount namespace could never be freed.) See EXAMPLES for an example of how to ensure that the parent mount does not have shared propagation. -p, --pid[=pathname] Create a new PID namespace. If pathname is specified, then the namespace is made persistent by bind mounting the corresponding /proc/PID/ns/pid_for_children magic link on the regular file specified by pathname. Note, however, that even if a PID namespace is made persistent, it will no longer be usable (e.g., it can't be entered with nsenter(1)) if its init process has terminated. If the --unshare option is also employed, then the --fork option must additionally be employed in order to create the bind mount. -n, --net[=pathname] Create a new network namespace. If pathname is specified, then the namespace is made persistent by bind mounting the corresponding /proc/PID/ns/net magic link on the regular file specified by pathname. -t, --time[=pathname] Create a new time namespace. If pathname is specified, then the namespace is made persistent by bind mounting the corresponding /proc/PID/ns/time_for_children magic link on the regular file specified by pathname. In order to create a new time namespace, the --unshare option must also be specified (or an error results). Typically, you will also want to specify the --fork option, so that command is run in a process in the new namespace; without the --fork option, only the child processes created by command will reside in the new namespace. See also --boottime and --monotonic. -u, --uts[=pathname] Create a new UTS namespace. If pathname is specified, then the namespace is made persistent by bind mounting the corresponding /proc/PID/ns/uts magic link on the regular file specified by pathname. -U, --user[=pathname] Create a new user namespace. If pathname is specified, then the namespace is made persistent by bind mounting the corresponding /proc/PID/ns/user magic link on the regular file specified by pathname. Other options -r, --map-root-user When creating a new user namespace, create the so-called root credential mappings: the user's UID and GID (i.e., the effective UID and GID under which nscreate is being run) are mapped to 0 (root) inside the new user namespace, before program is executed. This means that the process that executes program will maintain root privileges (i.e., all capabilities) in the user namespace. (Without this option, the process's capabilities will be cleared during execve(2), as described in capabilities(7).) This option can be employed only when creating a user namespace (--user). --uid-map=map When creating a new user namespace, this option can be used (subject to permissions rules described in user_namespaces(7)) to define an arbitrary UID map for the new namespace. The map string consists of a series of numeric three-tuples of the form: <ID-inside-NS> <ID-outside-NS> <length> The tuples must be separated either by newline characters or by commas (which are replaced by newline characters before the strings are written to the map files). For a description of the meaning of the three numbers in each tuple, see user_namespaces(7); see also EXAMPLES, below. This option can be employed only when creating a user names‐ pace (--user). --gid-map=map When creating a new user namespace, this option can be used (subject to permissions rules) to define an arbitrary GID map for the new namespace. The syntax of map is as for --uid-map. This option can be employed only when creating a user names‐ pace (--user). --boottime When creating a new time namespace, this option can be used to specify the offset of the boot-time (CLOCK_BOOTTIME) clock, in seconds. This option can be specified only when creating a time names‐ pace (--time). --monotonic When creating a new time namespace, this option can be used to specify the offset of the monotonic (CLOCK_MONOTONIC) clock, in seconds. This option can be specified only when creating a time names‐ pace (--time). --no-deny-setgroups By default, when creating a user namespace, execution of the setgroups(2) system call is disabled by writing the string "deny" to the /proc/PID/setgroups file of a process inside the namespace. (For details of the reasons why, see user_namespaces(7)).) This option can be used to disable the step of modifying /proc/PID/setgroups in this way. If you are not superuser (more precisely, you do not have the CAP_SETGID capability), then updating the GID map for the user namespace is likely to fail (and thus nscreate itself will fail). This option can be employed only when creating a user names‐ pace (--user). --unshare By default, nscreate performs its task by using clone(2) to create the requested namespaces and create a child that exe‐ cutes program in those namespaces. If the --unshare option is specified, then the new namespaces are instead created using unshare(2), and program is executed directly (so that it replaces the nscreate program). Uses of the --unshare option include the following: · This option must be used when creating time namespaces, since current kernels don't support the creation of time namespaces using clone(). · This option can be useful in commands of the following form, where the shell itself is ultimately replaced by program: $ exec nscreate -Ur --unshare program Note that when using the --unshare option, the only mappings that can be defined using --uid-map and --gid-map are mappings that map just the user's UID and GID (i.e., ID-outside-NS must be the user's own UID/GID and length must be 1). -f, --fork Execute program in a child process created using fork(2). This option can be employed only in conjunction with --unshare or --pid. Using this option when creating a PID namespace (--pid) ensures that program is executed as PID 1 in the new PID namespace. This option is likewise useful when creating a time namespace (--time) in order to ensure that program is run in the new time namespace. --propagation=type This option is provided as a convenience for setting the prop‐ agation type of all the mount points in a new mount namespace. type is one of the following: private Give all mount points "private" propagation. This is the default, so that mount and unmount operations do not have unintended side effects in the previous mount namespace. shared Give all mount points "shared" propagation. slave Give all mount points "slave" propagation. unchanged Do not change the propagation type of mounts in the new mount namespace; mount points preserve the propagation that they had in the namespace from which they were inherited. Thus, for example, --propagation=private has the same effect as issuing the following shell command in a new mount names‐ pace: # mount --make-rprivate / For further information on mount propagation, see mount_namespaces(7), mount(2), mount(8), and the kernel source file Documentation/filesystems/sharedsubtree.txt. The --propagation option can be employed only when creating a new mount namespace (--mount). If both --propagation and --mount-proc are specified, --propagation is actioned first. --mount-proc Mount a proc(5) filesystem at /proc. This option is provided as a convenience when creating a new PID namespace (--pid): the provision of a new /proc mount ensures that tools such as ps(1) and top(1) work correctly inside the new PID namespace. The new mount is created by first setting the propagation type of the existing /proc mount to private and then stacking a new proc(5) mount at /proc. Since the parent of the stacked mount is the previous /proc mount (which now has private propaga‐ tion), the stacked mount will not propagate to other mount namespaces. This option can be employed only in conjunction with the use of the --mount/-m option to create a new mount namespace. --child-exit-sig[=sig] If nscreate terminates, then send the specified signal to the child process that is executing program. The signal may be specified by name (e.g., quit) or number (e.g., 3). If sig is omitted, the default is kill (SIGKILL). This option is mainly useful when creating a PID namespace, in order to ensure that all processes in the namespace are termi‐ nated if nscreate itself terminates. This happens because the child process running program has PID 1 (i.e., it is the init process in the PID namespace) and if PID 1 in a PID namespace terminates, then the kernel terminates all other processes in the namespace. Note that aside from SIGKILL and SIGSTOP, the only signals that can be sent to the init process of a PID namespace are those signals for which the process has estab‐ lished a handler; see pid_namespaces(7). This option can be employed only in the default "clone" mode or in combination with both --unshare and --fork. --no-new-privs Set the no_new_privs process attribute (see prctl(2)). This prevents the process from changing credentials or gaining capabilities if program is a set-user-ID or set-group-ID pro‐ gram, or has file capabilities attached. -h, --help Display help text and exit. Repeatable options The following options are performed, in the order given on the com‐ mand-line, after all other options are actioned, just before program is executed. Some of these options can (meaningfully) be repeated. --make-caps-inheritable Copy the process's permitted capabilities to the inheritable set. This allows a process with nonzero UIDs to preserve those capabilities when executing a new program, so long as the program being executed also has inheritable capabilities. This option is useful primarily when creating a user names‐ pace. --make-caps-ambient Copy the process's permitted capabilities to the inheritable and ambient sets. (Copying to the inheritable set is neces‐ sary because a capability can't be made ambient without first being made inheritable.) This allows a process with nonzero UIDs to preserve capabilities when executing a new program. This option is useful primarily when creating a user names‐ pace. --setuid={uid|ruid,euid,suid} Set the process user IDs. If just one user ID is specified, set the real, effective, and saved set user IDs to the speci‐ fied (numeric) value. If three (numeric) UIDs are specified, then use those values to set, respectively, the process's real, effective, and saved set user IDs; in this case, -1 can be specified as a value, meaning leave the corresponding UID unchanged. The UID(s) are interpreted relative to the UID map of the user namespace. This option is useful primarily when creating a new user namespace. --setgid={gid|rgid,egid,sgid} Set the process group IDs. If just one group ID is specified, set the real, effective, and saved set group IDs to the speci‐ fied (numeric) value. If three (numeric) GIDs are specified, then use those values to set, respectively, the process's real, effective, and saved set group IDs. The GID(s) are interpreted relative to the GID map of the user namespace. This option is useful primarily when creating a new user namespace. --clear-groups Clear the supplementary group list. If the --no-deny-set‐ groups option is not also specified, an error results. An error likewise results if nscreate is not run as superuser (or, more precisely, with the CAP_SETGID capability). This option is useful when creating a user namespace in order to ensure that the process running program does not inherit a set of supplementary groups that do no exist in the GID map. --secbits=[+-]flag[,...] Set the process securebits flags. The option value is either 0, meaning clear all securebits flags (if possible), or a comma-separated list of flag names, optionally preceding by a plus ('+') or ('-') sign. If a plus sign is specified, then the specified flags are enabled, while the remaining flags are left unchanged. If a minus sign is specified, then the speci‐ fied flags are disabled, while the remaining flags are left unchanged. If neither a plus nor a minus sign is specified, then the specified flags are set and the remaining flags are cleared. The flags, which can be specified in either long or abbrevi‐ ated form, are as follows: keep_caps / kc Set the SECBIT_KEEP_CAPS flag, so that a process with one or more zero user IDs does not lose permitted capa‐ bilities when it makes all of its user IDs nonzero. This flag is automatically cleared when a new program is executed. Thus, even if it is set, this flag will be cleared when program is executed. This flag provides a subset of the functionality of the no_setuid_fixup flag, and is ignored by the kernel if that flag is also set. keep_caps_locked / kcl Set the SECBIT_KEEP_CAPS_LOCKED flag. Note: setting this flag does not prevent the SECBIT_KEEP_CAPS flag from being cleared when a new program is executed. no_setuid_fixup / nsf Set the SECBIT_NO_SETUID_FIXUP flag, so that switching the process's capabilities between zero and nonzero values does not cause any changes to the process's capabilities. no_setuid_fixup_locked / nsfl Set the SECBIT_NO_SETUID_FIXUP_LOCKED flag. noroot / nr Set the SECBIT_NOROOT flag, so that the kernel does not grant any capabilities to the process if it executes a set-user-ID-root program or if the process executes a program while having a real or effective user ID of zero. noroot_locked / nrl Set the SECBIT_NOROOT_LOCKED flag. no_cap_ambient_raise / ncar Set the SECBIT_NO_CAP_AMBIENT_RAISE flag, so that the kernel no longer permits the process to raise capabili‐ ties in its ambient set. no_cap_ambient_raise_locked / ncarl Set the SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED flag. As can be seen from the above list, the flags are organized in pairs: a "base" flag, and a corresponding "locked" flag. Set‐ ting a "locked" flag makes the corresponding "base" flag immutable, and once set, a "locked" flag can't be cleared. In order to modify the securebits settings, nscreate must be run as superuser (or, more precisely, with the CAP_SETPCAP capability). For further details on the securebits feature, see capabilities(7). --set-caps=cap-spec Set process permitted, effective, and inheritable capabili‐ ties. cap-spec is as per the argument of cap_from_text(3). --adj-caps=spec Adjust the process capability sets. The argument has one of the following forms: <flags><op>all <flags><op>[~]<cap>,... flags specifies the capability sets to be modified and is a list of one of more of the following letters: p (permitted); e (effective); i (inheritable); a (ambient); and b (bounding). op is either '+' or '-' indicating that the specified capabil‐ ities are to be added or removed from the sets specified by flags. Following op there may be either the word all, meaning all capabilities, or a list of capabilities optionally preceded by a tilde (~). In the latter form, capabilities can be speci‐ fied either symbolically (e.g., cap_kill) or as numbers. The optional tilde causes the set to be inverted; thus, ~cap_sys_admin means all capabilities except CAP_SYS_ADMIN. The capability sets are modified as individual operations, in the order given in flags. Note that there are many rules governing the changes that can be made to capability sets; see capabilities(7) for details. --dump[=opts] Dump the current state of security-related information for the process. This may be useful when debugging your understanding of the effect of various nscreate command-line options. The optional argument, opts, is a comma-separated list of one or more of the following: eids Dump process effective user ID and effective group ID. This option is ignored if creds is also specified. creds Dump all (i.e., real, effective, saved set) user and group IDs. groups Dump the process supplementary group list. caps Dump process permitted, effective, and inheritable capabilities in the format produced by cap_to_text(3). secbits Dump the process securebits flags. If opts is omitted, the default is eids,caps. --wait=num-secs Pause execution for num-secs seconds. This allows you to per‐ form other actions (e.g., inspecting the state of the process from another terminal) during the pause. As with --dump, this option may be useful when debugging your understanding of the effect of various nscreate command-line options.
If program is successfully executed, then the exit status of nscreate is the exit status of program. If program could not be executed (e.g., because a child process could not be created, or program could not be found), then nscreate exits with the status 1.
Command-line interface design philosophy nscreate shares much functionality with unshare(1) while at the same time adding various functionality that is absent from the latter command. Aside from the functional differences, there are also some differences in design philosophy, including the following: · The use of single-letter options is generally reserved for frequently used options that either do not take arguments, or take arguments that are short. This design choice reflects the fact that there is a limited set of possible single-letter options, and therefore the option letters should be consumed only in cases where it is obviously useful to do so. This conservative approach leaves single-letter options available for options that may be added in the future. By contrast, the liberal use of single-letter options in unshare(1) has led to some common options using surprising single-letter options because the more obvious letter choice had already been used by a previously added option that is in some cases rarely used. · "Minor" options generally do not imply "major" options. For example (and unlike unshare(1)), the -r option does not imply the -U option. The rationale here is that requiring the use of the necessary major option leads to commands and scripts that are more readable, and less prone to surprising errors. (For example, it is quite easy to overlook that unshare -r creates a user namespace, especially if the -r is interspersed with other options.)
In the following shell session, nscreate is used to run a shell in a new user namespace (with root credential mappings) and a new UTS namespace. In the new user namespace, the shell has a full set of permitted and effective capabilities (=ep), which allows the user to change the hostname in the new UTS namespace. $ hostname # Show hostname in initial UTS namespace bienne $ nscreate -Uur bash --norc # getpcaps $$ Capabilities for `24893' =ep # hostname orinoco # hostname orinoco The following command is equivalent to nscreate -Ur: $ nscreate --user --uid_map="0 $(id -u) 1" --gid-map="0 $(id -g) 1" Creating a persistent mount namespace The following shell session demonstrates the creation of a mount namespace that is made persistent via a bind mount on the pathname /mnt/ns/mp. We first observe that initially the nearest ancestor mount point is the root mount (/), which has shared propagation. As noted above, this would lead to a failure when creating the bind mount on /mnt/ns/mp. To avoid this, we create an intermediate bind mount point whose propagation is made private. We then create the mount point pathname and use nscreate to create a persistent mount namespace. # findmnt -n -o target,propagation --target /mnt/ns / shared # mount --bind /mnt /mnt # mount --make-private /mnt # touch /mnt/ns/mp # nscreate --mount=/mnt/ns/mp sh Creating a time namespace The following example demonstrates the creation of a time namespace where the boottime clock is set to a point several years in the past: # uptime -p # Show uptime in initial time namespace up 30 minutes # nscreate --unshare --fork --time --boottime=200000000 uptime -p up 6 years, 18 weeks, 4 days, 20 hours, 4 minutes Creating a PID namespace with a properly mounted /proc filesystem In the following example, a shell is executed in new user, PID, and mount namespaces. The --mount-proc option ensures a correctly mounted /proc filesystem for the new PID namespace, so that ps(1) shows correct output for the PID namespace. # PS1='ns2# ' nscreate -Urpm --mount-proc bash --norc ns2# ps a PID TTY STAT TIME COMMAND 1 pts/4 S 0:00 bash --norc 2 pts/4 R+ 0:00 ps a Creating arbitrary UID and GID maps in a user namespace In the following commands, we take a local copy of the nscreate exe‐ cutable and then assign suitable capabilities to that copy so that an unprivileged user can create a user namespace with arbitrary UID and GID maps. $ cp $(which nscreate) . $ sudo setcap cap_setuid,cap_setgid=pe nscreate $ id -u; id -g 1000 1000 $ ./nscreate -U --uid-map="0 1000 10, 10 2000 10" \ --gid-map="0 1000 10" bash --norc # id -u; id -g 0 0 # cat /proc/self/uid_map 0 1000 10 10 2000 10 Demonstration of the effect of UID transitions on capabilities In the following shell session, which is executed as an unprivileged user (UID 1000), we make a local copy of the nscreate executable and assign it the necessary capabilities that allow an unprivileged user to create a user namespace with UID and GID maps that map a range of IDs. We then twice use the local nscreate executable to create a user namespace that maps UIDs (and GIDs) 1000 to 1009 to 0 to 9 inside the namespace. Within the user namespace, we run the getpcaps(8) program asking it show its own capabilities. $ id -u 1000 $ cp $(which nscreate) . $ sudo setcap cap_setuid,cap_setgid=pe nscreate $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \ getpcaps 0 Capabilities for `0': =ep $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \ --setuid 1 getpcaps 0 Capabilities for `0': = Looking at the results of the two executions of nscreate, we see that in the first case getpcaps(8) shows that it has all permitted and effective capabilities. In the second case, we used the --setuid option to switch the process's UIDs from 0 to 1 inside the user namespace, and as a result getpcaps(8) shows that it has no capabili‐ ties (because switching UIDs from 0 to nonzero causes the process's permitted and effective capabilities to be cleared). Capabilities, UID transitions, and securebits The kernel automatically assigns the first process in a new user namespace a full set of permitted and effective capabilities. How‐ ever, if the process changes its user IDs to nonzero values (inside the user namespace), it loses all capabilities (see capabilities(7)). We can illustrate this by once again making a local copy of the nscreate program and assigning capabilities to the copy which allow the creation of a user namespace with arbitrary UID and GID mappings. Initially, the process has UID 0 inside the user namespace, but we then use the --setuid option to switch the process's user IDs to a nonzero value. The --dump option shows us that the process has con‐ sequently lost all capabilities: $ id -u 1000 $ cp $(which nscreate) . $ sudo setcap cap_setuid,cap_setgid=pe nscreate $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \ --setuid 1 --dump /bin/true eUID = 1; eGID = 0 capabilities: = A process can enable the SECBIT_NO_SETUID_FIXUP securebits flag so that its capabilities do not change when its user IDs transition between 0 and nonzero values. We can illustrate this with the addi‐ tion of the --secbits option to the preceding command: $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \ --secbits=no_setuid_fixup \ --setuid 1 --dump /bin/true eUID = 1; eGID = 0 capabilities: =ep Ambient capabilities When a process that has nonzero UIDs executes a new program, its capabilities are transformed according to the rules described in capabilities(7). In particular, if the process has capabilities, then in the usual case it will lose those capabilities during the execve(2). Using a local copy of nscreate with capabilities attached, we can see this behavior. We create a user namespace that allows a range of UIDs. The process in the user namespace switches its UID from 0 to 1, after first setting the SECBIT_NO_SETUID_FIXUP securebits flag, so that the UID transition does not cause the process to lose capabili‐ ties. The --dump option allows us to verify that the process still has capabilities before it executes getpcaps(8). However, after the execve(2), getpcaps(8) shows us that the process no longer has any capabilities. $ id -u 1000 $ cp $(which nscreate) . $ sudo setcap cap_setuid,cap_setgid=pe nscreate $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \ --secbits=no_setuid_fixup --setuid 1 --dump getpcaps 0 eUID = 1; eGID = 0 capabilities: =ep Capabilities for `0': = The ambient capability set can be used to prevent a process losing its capabilities in the above scenario. Before executing a new pro‐ gram, the process can place any of its permitted capabilities into the ambient set, and those capabilities will (as per the transforma‐ tion rules described in capabilities(7)) be assigned to the process's permitted and effective capabilities after the execve(2). Adding the --make-caps-ambient option, which copies the process's existing per‐ mitted capabilities to the inheritable and ambient sets, can be used to demonstrate this behavior: $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \ --make-caps-ambient \ --secbits=no_setuid_fixup --setuid 1 --dump getpcaps 0 eUID = 1; eGID = 0 capabilities: =eip Capabilities for `0': =eip The last line, output by getpcaps(8), shows us that the process has all capabilities in its permitted, effective, and inheritable sets. Capabilities, exec, and securebits As noted above, the kernel automatically assigns the first process in a new user namespace a full set of permitted and effective capabili‐ ties. Suppose that the process then drops capabilities, while retaining user ID 0 (inside the namespace). If the process subse‐ quently executes a new program, it once more regains all capabilities because it did an execve(2) with UID 0 (see capabilities(7)). The following command illustrates this behavior: $ id -u 1000 $ nscreate -Ur --set-caps = --dump getpcaps 0 eUID = 0; eGID = 0 capabilities: = Capabilities for `0': =ep The --dump options shows us that just before getpcaps(8) was exe‐ cuted, the process had no capabilities. The last line of output, produced by getpcaps(8), shows us that the process once again has all capabilities. A process with user ID 0 can enable the SECBIT_NOROOT securebits flag so that its capabilities do not change when it executes a new pro‐ gram. To illustrate this, compare the output of the previous command with the output of the following command: $ nscreate -Ur --secbits=noroot --set-caps = --dump getpcaps 0 eUID = 0; eGID = 0 capabilities: = Capabilities for `0': =ep User namespace "set-user-ID-root" programs Here, we demonstrate the operation of a per-user-namespace "set-user- ID-root" program (see capabilities(7)). Again, we use a local copy of the nscreate executable that has capabilities assigned. We then (as unprivileged user ID 1000) make a local copy of the getpcaps(8) program and make it a set-user-ID program (i.e., the program will cause the effective user ID of the executing process to switch to 1000). Finally, we use the local nscreate executable to create a user namespace that maps UIDs (and GIDs) 1000 to 1009 to 0 to 9 inside the namespace, switch user ID to 1 inside the user namespace (which causes the process to lose all capabilities), and then execute the local copy of getpcaps(8). $ id -u 1000 $ cp $(which nscreate) . $ sudo setcap cap_setuid,cap_setgid=pe nscreate $ getcap nscreate # Verify capabilities of local nscreate nscreate = cap_setgid,cap_setuid+ep $ cp $(which getpcaps) . $ chmod u+s getpcaps $ ls -ln getpcaps # Verify ownership and permissions -rwsr-xr-x. 1 1000 1000 15992 Jun 10 08:24 getpcaps $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \ --setuid 1 ./getpcaps 0 Capabilities for `0': =ep Looking at the above output, we see that the process running getpcaps has all capabilities (inside the user namespace). This happened because the set-user-ID getpcaps executable caused the process's user ID to switch to 1000, which is equivalent to 0 inside the user names‐ pace, and as a result the process gained all permitted and effective capabilities.
Michael Kerrisk ⟨mtk.manpages@gmail.com⟩
findmnt(1), nsenter(1), setpriv(1), unshare(1), clone(2), unshare(2), namespaces(7), mount(8)
This page is part of the util-linux (a random collection of Linux
utilities) project. Information about the project can be found at
⟨https://www.kernel.org/pub/linux/utils/util-linux/⟩. If you have a
bug report for this manual page, send it to
util-linux@vger.kernel.org. This page was obtained from the
project's upstream Git repository
⟨git://git.kernel.org/pub/scm/utils/util-linux/util-linux.git⟩ on
2020-07-14. (At that time, the date of the most recent commit that
was found in the repository was 2020-07-14.) If you discover any
rendering problems in this HTML version of the page, or you believe
there is a better or more up-to-date source for the page, or you have
corrections or improvements to the information in this COLOPHON
(which is not part of the original manual page), send a mail to
man-pages@man7.org
secisol-tools 2020-07-01 NSCREATE(1)