|
NAME | SYNOPSIS | DESCRIPTION | OPTIONS | EXIT STATUS | NOTES | EXAMPLES | AUTHORS | SEE ALSO | COLOPHON |
|
NSCREATE(1) User Commands NSCREATE(1)
nscreate - run program in new namespaces
nscreate [options] [program [arguments]]
The nscreate command creates new namespaces (as specified by the
command-line options described below) and then executes the specified
program with arguments.
nsenter provides two modes of operation. The default mode uses the
clone(2) system call to create a child process that is placed in the
new namespaces and which executes program. The other mode, employed
when the --unshare option is specified, uses unshare(2) to create the
new namespaces and then directly executes program.
By default, a new namespace remains in existence only as long as it
has at least one member process. A namespace can be made persistent—
that is, pinned into existence even when it has no member processes—
by bind mounting the corresponding /proc/PID/ns/ns-type file.
nscreate provides command-line options to simplify the creation of
such bind mounts. A persistent namespace can later be entered using
nsenter(1), even after program has terminated. A persistent
namespace can be unpinned by unmounting the bind mount.
If program is not supplied, then the program identified by the SHELL
environment variable is run; if SHELL is not defined, then /bin/sh is
executed.
The following types of namespaces can be created using nscreate:
cgroup namespace
Cgroup namespaces virtualize the view of cgroups seen in
/proc/[pid]/cgroup and /proc/[pid]/mountinfo. For further
details, see cgroup_namespaces(7) and cgroups(7).
IPC namespace
Processes within an IPC namespace have private instances of
certain interprocess communication resources, namely System V
IPC objects (message queues, semaphores, shared memory) and
POSIX message queues. For further details, see
ipc_namespaces(7).
mount namespace
Processes within a mount namespace share a set of mount
points. Processes in different mount namespaces thus see
distinct single directory hierarchies. Mounting and
unmounting filesystems in one mount namespace will not affect
processes in other mount namespaces, except where a mount
point has shared propagation. For further details, see
mount_namespaces(7), mount(2), mount(8), and the kernel source
file Documentation/filesystems/sharedsubtree.txt.
network namespace
Processes in a network network share private instances of
various networking resources, such as networking devices, IPv4
and IPv6 protocol stacks, routing tables, firewall rules, and
socket port numbers. Thus, for example, each network
namespace can have its own (virtual) network device with its
own IP address, and each network namespace can have a web
server running on port 80. For further details, see
network_namespaces(7).
PID namespace
PID namespaces isolate the PID number space, meaning that the
PIDs of processes within a PID namespace are private to that
namespace. For further details, see pid_namespaces(7).
time namespace
Time namespaces virtualize the values of certain system
clocks, namely the boot-time and monotonic clocks. Thus, the
processes within a time namespace share the same values for
these clocks, but the values of the clocks may be different in
other time namespaces. For further details, see
time_namespaces(7).
user namespace
User namespaces virtualize certain security-related
identifiers and attributes, such as user IDs, group IDs, and
capabilities. Practically speaking, this means that a process
may have certain credentials—for example, UID and GID 0, and
all capabilities (i.e., superuser powers)—inside a user
namespace, while at the same time having nonzero credentials
and no capabilities outside that user namespace. For further
details, see user_namespaces(7), capabilities(7), and
credentials(7).
UTS namespace
Processes within a UTS namespace share a private instance of
two system identifiers: the hostname and the NIS domain name.
For further details, see uts_namespaces(7).
Options for creating namespaces
The following options can be used to create new namespaces. The
short-form options take no argument. The long-form options take an
optional argument, which is the pathname of an existing file that
will be used as the target when creating a bind mount in order to
make the namespace persistent.
-c, --cgroup[=pathname]
Create a new cgroup namespace. If pathname is specified, then
the namespace is made persistent by bind mounting the
corresponding /proc/PID/ns/cgroup magic link on the regular
file specified by pathname.
-i, --ipc[=pathname]
Create a new IPC namespace. If pathname is specified, then
the namespace is made persistent by bind mounting the
corresponding /proc/PID/ns/ipc magic link on the regular file
specified by pathname.
-m, --mount[=pathname]
Create a new mount namespace. If pathname is specified, then
the namespace is made persistent by bind mounting the
corresponding /proc/PID/ns/mnt magic link on the regular file
specified by pathname. Note that creating this bind mount
will fail if the propagation type of the parent mount of
pathname is shared. (The kernel disallows creation of the
bind mount in this scenario because propagation of the mount
point might lead to a circular dependency that would mean that
the mount namespace could never be freed.) See EXAMPLES for
an example of how to ensure that the parent mount does not
have shared propagation.
-p, --pid[=pathname]
Create a new PID namespace. If pathname is specified, then
the namespace is made persistent by bind mounting the
corresponding /proc/PID/ns/pid_for_children magic link on the
regular file specified by pathname. Note, however, that even
if a PID namespace is made persistent, it will no longer be
usable (e.g., it can't be entered with nsenter(1)) if its init
process has terminated.
If the --unshare option is also employed, then the --fork
option must additionally be employed in order to create the
bind mount.
-n, --net[=pathname]
Create a new network namespace. If pathname is specified,
then the namespace is made persistent by bind mounting the
corresponding /proc/PID/ns/net magic link on the regular file
specified by pathname.
-t, --time[=pathname]
Create a new time namespace. If pathname is specified, then
the namespace is made persistent by bind mounting the
corresponding /proc/PID/ns/time_for_children magic link on the
regular file specified by pathname.
In order to create a new time namespace, the --unshare option
must also be specified (or an error results). Typically, you
will also want to specify the --fork option, so that command
is run in a process in the new namespace; without the --fork
option, only the child processes created by command will
reside in the new namespace.
See also --boottime and --monotonic.
-u, --uts[=pathname]
Create a new UTS namespace. If pathname is specified, then
the namespace is made persistent by bind mounting the
corresponding /proc/PID/ns/uts magic link on the regular file
specified by pathname.
-U, --user[=pathname]
Create a new user namespace. If pathname is specified, then
the namespace is made persistent by bind mounting the
corresponding /proc/PID/ns/user magic link on the regular file
specified by pathname.
Other options
-r, --map-root-user
When creating a new user namespace, create the so-called root
credential mappings: the user's UID and GID (i.e., the
effective UID and GID under which nscreate is being run) are
mapped to 0 (root) inside the new user namespace, before
program is executed. This means that the process that
executes program will maintain root privileges (i.e., all
capabilities) in the user namespace. (Without this option,
the process's capabilities will be cleared during execve(2),
as described in capabilities(7).)
This option can be employed only when creating a user
namespace (--user).
--uid-map=map
When creating a new user namespace, this option can be used
(subject to permissions rules described in user_namespaces(7))
to define an arbitrary UID map for the new namespace.
The map string consists of a series of numeric three-tuples of
the form:
<ID-inside-NS> <ID-outside-NS> <length>
The tuples must be separated either by newline characters or
by commas (which are replaced by newline characters before the
strings are written to the map files). For a description of
the meaning of the three numbers in each tuple, see
user_namespaces(7); see also EXAMPLES, below.
This option can be employed only when creating a user names‐
pace (--user).
--gid-map=map
When creating a new user namespace, this option can be used
(subject to permissions rules) to define an arbitrary GID map
for the new namespace.
The syntax of map is as for --uid-map.
This option can be employed only when creating a user names‐
pace (--user).
--boottime
When creating a new time namespace, this option can be used to
specify the offset of the boot-time (CLOCK_BOOTTIME) clock, in
seconds.
This option can be specified only when creating a time names‐
pace (--time).
--monotonic
When creating a new time namespace, this option can be used to
specify the offset of the monotonic (CLOCK_MONOTONIC) clock,
in seconds.
This option can be specified only when creating a time names‐
pace (--time).
--no-deny-setgroups
By default, when creating a user namespace, execution of the
setgroups(2) system call is disabled by writing the string
"deny" to the /proc/PID/setgroups file of a process inside the
namespace. (For details of the reasons why, see
user_namespaces(7)).) This option can be used to disable the
step of modifying /proc/PID/setgroups in this way.
If you are not superuser (more precisely, you do not have the
CAP_SETGID capability), then updating the GID map for the user
namespace is likely to fail (and thus nscreate itself will
fail).
This option can be employed only when creating a user names‐
pace (--user).
--unshare
By default, nscreate performs its task by using clone(2) to
create the requested namespaces and create a child that exe‐
cutes program in those namespaces. If the --unshare option
is specified, then the new namespaces are instead created
using unshare(2), and program is executed directly (so that it
replaces the nscreate program).
Uses of the --unshare option include the following:
· This option must be used when creating time namespaces,
since current kernels don't support the creation of time
namespaces using clone().
· This option can be useful in commands of the following form,
where the shell itself is ultimately replaced by program:
$ exec nscreate -Ur --unshare program
Note that when using the --unshare option, the only mappings
that can be defined using --uid-map and --gid-map are mappings
that map just the user's UID and GID (i.e., ID-outside-NS must
be the user's own UID/GID and length must be 1).
-f, --fork
Execute program in a child process created using fork(2).
This option can be employed only in conjunction with --unshare
or --pid.
Using this option when creating a PID namespace (--pid)
ensures that program is executed as PID 1 in the new PID
namespace. This option is likewise useful when creating a
time namespace (--time) in order to ensure that program is run
in the new time namespace.
--propagation=type
This option is provided as a convenience for setting the prop‐
agation type of all the mount points in a new mount namespace.
type is one of the following:
private
Give all mount points "private" propagation. This is
the default, so that mount and unmount operations do
not have unintended side effects in the previous mount
namespace.
shared Give all mount points "shared" propagation.
slave Give all mount points "slave" propagation.
unchanged
Do not change the propagation type of mounts in the new
mount namespace; mount points preserve the propagation
that they had in the namespace from which they were
inherited.
Thus, for example, --propagation=private has the same effect
as issuing the following shell command in a new mount names‐
pace:
# mount --make-rprivate /
For further information on mount propagation, see
mount_namespaces(7), mount(2), mount(8), and the kernel source
file Documentation/filesystems/sharedsubtree.txt.
The --propagation option can be employed only when creating a
new mount namespace (--mount). If both --propagation and
--mount-proc are specified, --propagation is actioned first.
--mount-proc
Mount a proc(5) filesystem at /proc. This option is provided
as a convenience when creating a new PID namespace (--pid):
the provision of a new /proc mount ensures that tools such as
ps(1) and top(1) work correctly inside the new PID namespace.
The new mount is created by first setting the propagation type
of the existing /proc mount to private and then stacking a new
proc(5) mount at /proc. Since the parent of the stacked mount
is the previous /proc mount (which now has private propaga‐
tion), the stacked mount will not propagate to other mount
namespaces.
This option can be employed only in conjunction with the use
of the --mount/-m option to create a new mount namespace.
--child-exit-sig[=sig]
If nscreate terminates, then send the specified signal to the
child process that is executing program. The signal may be
specified by name (e.g., quit) or number (e.g., 3). If sig is
omitted, the default is kill (SIGKILL).
This option is mainly useful when creating a PID namespace, in
order to ensure that all processes in the namespace are termi‐
nated if nscreate itself terminates. This happens because the
child process running program has PID 1 (i.e., it is the init
process in the PID namespace) and if PID 1 in a PID namespace
terminates, then the kernel terminates all other processes in
the namespace. Note that aside from SIGKILL and SIGSTOP, the
only signals that can be sent to the init process of a PID
namespace are those signals for which the process has estab‐
lished a handler; see pid_namespaces(7).
This option can be employed only in the default "clone" mode
or in combination with both --unshare and --fork.
--no-new-privs
Set the no_new_privs process attribute (see prctl(2)). This
prevents the process from changing credentials or gaining
capabilities if program is a set-user-ID or set-group-ID pro‐
gram, or has file capabilities attached.
-h, --help
Display help text and exit.
Repeatable options
The following options are performed, in the order given on the com‐
mand-line, after all other options are actioned, just before program
is executed. Some of these options can (meaningfully) be repeated.
--make-caps-inheritable
Copy the process's permitted capabilities to the inheritable
set. This allows a process with nonzero UIDs to preserve
those capabilities when executing a new program, so long as
the program being executed also has inheritable capabilities.
This option is useful primarily when creating a user names‐
pace.
--make-caps-ambient
Copy the process's permitted capabilities to the inheritable
and ambient sets. (Copying to the inheritable set is neces‐
sary because a capability can't be made ambient without first
being made inheritable.) This allows a process with nonzero
UIDs to preserve capabilities when executing a new program.
This option is useful primarily when creating a user names‐
pace.
--setuid={uid|ruid,euid,suid}
Set the process user IDs. If just one user ID is specified,
set the real, effective, and saved set user IDs to the speci‐
fied (numeric) value. If three (numeric) UIDs are specified,
then use those values to set, respectively, the process's
real, effective, and saved set user IDs; in this case, -1 can
be specified as a value, meaning leave the corresponding UID
unchanged.
The UID(s) are interpreted relative to the UID map of the user
namespace.
This option is useful primarily when creating a new user
namespace.
--setgid={gid|rgid,egid,sgid}
Set the process group IDs. If just one group ID is specified,
set the real, effective, and saved set group IDs to the speci‐
fied (numeric) value. If three (numeric) GIDs are specified,
then use those values to set, respectively, the process's
real, effective, and saved set group IDs.
The GID(s) are interpreted relative to the GID map of the user
namespace.
This option is useful primarily when creating a new user
namespace.
--clear-groups
Clear the supplementary group list. If the --no-deny-set‐
groups option is not also specified, an error results. An
error likewise results if nscreate is not run as superuser
(or, more precisely, with the CAP_SETGID capability).
This option is useful when creating a user namespace in order
to ensure that the process running program does not inherit a
set of supplementary groups that do no exist in the GID map.
--secbits=[+-]flag[,...]
Set the process securebits flags. The option value is either
0, meaning clear all securebits flags (if possible), or a
comma-separated list of flag names, optionally preceding by a
plus ('+') or ('-') sign. If a plus sign is specified, then
the specified flags are enabled, while the remaining flags are
left unchanged. If a minus sign is specified, then the speci‐
fied flags are disabled, while the remaining flags are left
unchanged. If neither a plus nor a minus sign is specified,
then the specified flags are set and the remaining flags are
cleared.
The flags, which can be specified in either long or abbrevi‐
ated form, are as follows:
keep_caps / kc
Set the SECBIT_KEEP_CAPS flag, so that a process with
one or more zero user IDs does not lose permitted capa‐
bilities when it makes all of its user IDs nonzero.
This flag is automatically cleared when a new program
is executed. Thus, even if it is set, this flag will
be cleared when program is executed.
This flag provides a subset of the functionality of the
no_setuid_fixup flag, and is ignored by the kernel if
that flag is also set.
keep_caps_locked / kcl
Set the SECBIT_KEEP_CAPS_LOCKED flag.
Note: setting this flag does not prevent the
SECBIT_KEEP_CAPS flag from being cleared when a new
program is executed.
no_setuid_fixup / nsf
Set the SECBIT_NO_SETUID_FIXUP flag, so that switching
the process's capabilities between zero and nonzero
values does not cause any changes to the process's
capabilities.
no_setuid_fixup_locked / nsfl
Set the SECBIT_NO_SETUID_FIXUP_LOCKED flag.
noroot / nr
Set the SECBIT_NOROOT flag, so that the kernel does not
grant any capabilities to the process if it executes a
set-user-ID-root program or if the process executes a
program while having a real or effective user ID of
zero.
noroot_locked / nrl
Set the SECBIT_NOROOT_LOCKED flag.
no_cap_ambient_raise / ncar
Set the SECBIT_NO_CAP_AMBIENT_RAISE flag, so that the
kernel no longer permits the process to raise capabili‐
ties in its ambient set.
no_cap_ambient_raise_locked / ncarl
Set the SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED flag.
As can be seen from the above list, the flags are organized in
pairs: a "base" flag, and a corresponding "locked" flag. Set‐
ting a "locked" flag makes the corresponding "base" flag
immutable, and once set, a "locked" flag can't be cleared.
In order to modify the securebits settings, nscreate must be
run as superuser (or, more precisely, with the CAP_SETPCAP
capability).
For further details on the securebits feature, see
capabilities(7).
--set-caps=cap-spec
Set process permitted, effective, and inheritable capabili‐
ties. cap-spec is as per the argument of cap_from_text(3).
--adj-caps=spec
Adjust the process capability sets. The argument has one of
the following forms:
<flags><op>all
<flags><op>[~]<cap>,...
flags specifies the capability sets to be modified and is a
list of one of more of the following letters: p (permitted); e
(effective); i (inheritable); a (ambient); and b (bounding).
op is either '+' or '-' indicating that the specified capabil‐
ities are to be added or removed from the sets specified by
flags.
Following op there may be either the word all, meaning all
capabilities, or a list of capabilities optionally preceded by
a tilde (~). In the latter form, capabilities can be speci‐
fied either symbolically (e.g., cap_kill) or as numbers. The
optional tilde causes the set to be inverted; thus,
~cap_sys_admin means all capabilities except CAP_SYS_ADMIN.
The capability sets are modified as individual operations, in
the order given in flags.
Note that there are many rules governing the changes that can
be made to capability sets; see capabilities(7) for details.
--dump[=opts]
Dump the current state of security-related information for the
process. This may be useful when debugging your understanding
of the effect of various nscreate command-line options.
The optional argument, opts, is a comma-separated list of one
or more of the following:
eids Dump process effective user ID and effective group ID.
This option is ignored if creds is also specified.
creds Dump all (i.e., real, effective, saved set) user and
group IDs.
groups Dump the process supplementary group list.
caps Dump process permitted, effective, and inheritable
capabilities in the format produced by cap_to_text(3).
secbits
Dump the process securebits flags.
If opts is omitted, the default is eids,caps.
--wait=num-secs
Pause execution for num-secs seconds. This allows you to per‐
form other actions (e.g., inspecting the state of the process
from another terminal) during the pause. As with --dump, this
option may be useful when debugging your understanding of the
effect of various nscreate command-line options.
If program is successfully executed, then the exit status of nscreate
is the exit status of program. If program could not be executed
(e.g., because a child process could not be created, or program could
not be found), then nscreate exits with the status 1.
Command-line interface design philosophy
nscreate shares much functionality with unshare(1) while at the same
time adding various functionality that is absent from the latter
command. Aside from the functional differences, there are also some
differences in design philosophy, including the following:
· The use of single-letter options is generally reserved for
frequently used options that either do not take arguments, or take
arguments that are short. This design choice reflects the fact
that there is a limited set of possible single-letter options, and
therefore the option letters should be consumed only in cases where
it is obviously useful to do so. This conservative approach leaves
single-letter options available for options that may be added in
the future. By contrast, the liberal use of single-letter options
in unshare(1) has led to some common options using surprising
single-letter options because the more obvious letter choice had
already been used by a previously added option that is in some
cases rarely used.
· "Minor" options generally do not imply "major" options. For
example (and unlike unshare(1)), the -r option does not imply the
-U option. The rationale here is that requiring the use of the
necessary major option leads to commands and scripts that are more
readable, and less prone to surprising errors. (For example, it is
quite easy to overlook that unshare -r creates a user namespace,
especially if the -r is interspersed with other options.)
In the following shell session, nscreate is used to run a shell in a
new user namespace (with root credential mappings) and a new UTS
namespace. In the new user namespace, the shell has a full set of
permitted and effective capabilities (=ep), which allows the user to
change the hostname in the new UTS namespace.
$ hostname # Show hostname in initial UTS namespace
bienne
$ nscreate -Uur bash --norc
# getpcaps $$
Capabilities for `24893' =ep
# hostname orinoco
# hostname
orinoco
The following command is equivalent to nscreate -Ur:
$ nscreate --user --uid_map="0 $(id -u) 1" --gid-map="0 $(id -g) 1"
Creating a persistent mount namespace
The following shell session demonstrates the creation of a mount
namespace that is made persistent via a bind mount on the pathname
/mnt/ns/mp. We first observe that initially the nearest ancestor
mount point is the root mount (/), which has shared propagation. As
noted above, this would lead to a failure when creating the bind
mount on /mnt/ns/mp. To avoid this, we create an intermediate bind
mount point whose propagation is made private. We then create the
mount point pathname and use nscreate to create a persistent mount
namespace.
# findmnt -n -o target,propagation --target /mnt/ns
/ shared
# mount --bind /mnt /mnt
# mount --make-private /mnt
# touch /mnt/ns/mp
# nscreate --mount=/mnt/ns/mp sh
Creating a time namespace
The following example demonstrates the creation of a time namespace
where the boottime clock is set to a point several years in the past:
# uptime -p # Show uptime in initial time namespace
up 30 minutes
# nscreate --unshare --fork --time --boottime=200000000 uptime -p
up 6 years, 18 weeks, 4 days, 20 hours, 4 minutes
Creating a PID namespace with a properly mounted /proc filesystem
In the following example, a shell is executed in new user, PID, and
mount namespaces. The --mount-proc option ensures a correctly
mounted /proc filesystem for the new PID namespace, so that ps(1)
shows correct output for the PID namespace.
# PS1='ns2# ' nscreate -Urpm --mount-proc bash --norc
ns2# ps a
PID TTY STAT TIME COMMAND
1 pts/4 S 0:00 bash --norc
2 pts/4 R+ 0:00 ps a
Creating arbitrary UID and GID maps in a user namespace
In the following commands, we take a local copy of the nscreate exe‐
cutable and then assign suitable capabilities to that copy so that an
unprivileged user can create a user namespace with arbitrary UID and
GID maps.
$ cp $(which nscreate) .
$ sudo setcap cap_setuid,cap_setgid=pe nscreate
$ id -u; id -g
1000
1000
$ ./nscreate -U --uid-map="0 1000 10, 10 2000 10" \
--gid-map="0 1000 10" bash --norc
# id -u; id -g
0
0
# cat /proc/self/uid_map
0 1000 10
10 2000 10
Demonstration of the effect of UID transitions on capabilities
In the following shell session, which is executed as an unprivileged
user (UID 1000), we make a local copy of the nscreate executable and
assign it the necessary capabilities that allow an unprivileged user
to create a user namespace with UID and GID maps that map a range of
IDs. We then twice use the local nscreate executable to create a
user namespace that maps UIDs (and GIDs) 1000 to 1009 to 0 to 9
inside the namespace. Within the user namespace, we run the
getpcaps(8) program asking it show its own capabilities.
$ id -u
1000
$ cp $(which nscreate) .
$ sudo setcap cap_setuid,cap_setgid=pe nscreate
$ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
getpcaps 0
Capabilities for `0': =ep
$ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
--setuid 1 getpcaps 0
Capabilities for `0': =
Looking at the results of the two executions of nscreate, we see that
in the first case getpcaps(8) shows that it has all permitted and
effective capabilities. In the second case, we used the --setuid
option to switch the process's UIDs from 0 to 1 inside the user
namespace, and as a result getpcaps(8) shows that it has no capabili‐
ties (because switching UIDs from 0 to nonzero causes the process's
permitted and effective capabilities to be cleared).
Capabilities, UID transitions, and securebits
The kernel automatically assigns the first process in a new user
namespace a full set of permitted and effective capabilities. How‐
ever, if the process changes its user IDs to nonzero values (inside
the user namespace), it loses all capabilities (see capabilities(7)).
We can illustrate this by once again making a local copy of the
nscreate program and assigning capabilities to the copy which allow
the creation of a user namespace with arbitrary UID and GID mappings.
Initially, the process has UID 0 inside the user namespace, but we
then use the --setuid option to switch the process's user IDs to a
nonzero value. The --dump option shows us that the process has con‐
sequently lost all capabilities:
$ id -u
1000
$ cp $(which nscreate) .
$ sudo setcap cap_setuid,cap_setgid=pe nscreate
$ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
--setuid 1 --dump /bin/true
eUID = 1; eGID = 0
capabilities: =
A process can enable the SECBIT_NO_SETUID_FIXUP securebits flag so
that its capabilities do not change when its user IDs transition
between 0 and nonzero values. We can illustrate this with the addi‐
tion of the --secbits option to the preceding command:
$ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
--secbits=no_setuid_fixup \
--setuid 1 --dump /bin/true
eUID = 1; eGID = 0
capabilities: =ep
Ambient capabilities
When a process that has nonzero UIDs executes a new program, its
capabilities are transformed according to the rules described in
capabilities(7). In particular, if the process has capabilities,
then in the usual case it will lose those capabilities during the
execve(2).
Using a local copy of nscreate with capabilities attached, we can see
this behavior. We create a user namespace that allows a range of
UIDs. The process in the user namespace switches its UID from 0 to
1, after first setting the SECBIT_NO_SETUID_FIXUP securebits flag, so
that the UID transition does not cause the process to lose capabili‐
ties. The --dump option allows us to verify that the process still
has capabilities before it executes getpcaps(8). However, after the
execve(2), getpcaps(8) shows us that the process no longer has any
capabilities.
$ id -u
1000
$ cp $(which nscreate) .
$ sudo setcap cap_setuid,cap_setgid=pe nscreate
$ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
--secbits=no_setuid_fixup --setuid 1 --dump getpcaps 0
eUID = 1; eGID = 0
capabilities: =ep
Capabilities for `0': =
The ambient capability set can be used to prevent a process losing
its capabilities in the above scenario. Before executing a new pro‐
gram, the process can place any of its permitted capabilities into
the ambient set, and those capabilities will (as per the transforma‐
tion rules described in capabilities(7)) be assigned to the process's
permitted and effective capabilities after the execve(2). Adding the
--make-caps-ambient option, which copies the process's existing per‐
mitted capabilities to the inheritable and ambient sets, can be used
to demonstrate this behavior:
$ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
--make-caps-ambient \
--secbits=no_setuid_fixup --setuid 1 --dump getpcaps 0
eUID = 1; eGID = 0
capabilities: =eip
Capabilities for `0': =eip
The last line, output by getpcaps(8), shows us that the process has
all capabilities in its permitted, effective, and inheritable sets.
Capabilities, exec, and securebits
As noted above, the kernel automatically assigns the first process in
a new user namespace a full set of permitted and effective capabili‐
ties. Suppose that the process then drops capabilities, while
retaining user ID 0 (inside the namespace). If the process subse‐
quently executes a new program, it once more regains all capabilities
because it did an execve(2) with UID 0 (see capabilities(7)). The
following command illustrates this behavior:
$ id -u
1000
$ nscreate -Ur --set-caps = --dump getpcaps 0
eUID = 0; eGID = 0
capabilities: =
Capabilities for `0': =ep
The --dump options shows us that just before getpcaps(8) was exe‐
cuted, the process had no capabilities. The last line of output,
produced by getpcaps(8), shows us that the process once again has all
capabilities.
A process with user ID 0 can enable the SECBIT_NOROOT securebits flag
so that its capabilities do not change when it executes a new pro‐
gram. To illustrate this, compare the output of the previous command
with the output of the following command:
$ nscreate -Ur --secbits=noroot --set-caps = --dump getpcaps 0
eUID = 0; eGID = 0
capabilities: =
Capabilities for `0': =ep
User namespace "set-user-ID-root" programs
Here, we demonstrate the operation of a per-user-namespace "set-user-
ID-root" program (see capabilities(7)). Again, we use a local copy
of the nscreate executable that has capabilities assigned. We then
(as unprivileged user ID 1000) make a local copy of the getpcaps(8)
program and make it a set-user-ID program (i.e., the program will
cause the effective user ID of the executing process to switch to
1000). Finally, we use the local nscreate executable to create a
user namespace that maps UIDs (and GIDs) 1000 to 1009 to 0 to 9
inside the namespace, switch user ID to 1 inside the user namespace
(which causes the process to lose all capabilities), and then execute
the local copy of getpcaps(8).
$ id -u
1000
$ cp $(which nscreate) .
$ sudo setcap cap_setuid,cap_setgid=pe nscreate
$ getcap nscreate # Verify capabilities of local nscreate
nscreate = cap_setgid,cap_setuid+ep
$ cp $(which getpcaps) .
$ chmod u+s getpcaps
$ ls -ln getpcaps # Verify ownership and permissions
-rwsr-xr-x. 1 1000 1000 15992 Jun 10 08:24 getpcaps
$ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
--setuid 1 ./getpcaps 0
Capabilities for `0': =ep
Looking at the above output, we see that the process running getpcaps
has all capabilities (inside the user namespace). This happened
because the set-user-ID getpcaps executable caused the process's user
ID to switch to 1000, which is equivalent to 0 inside the user names‐
pace, and as a result the process gained all permitted and effective
capabilities.
Michael Kerrisk ⟨mtk.manpages@gmail.com⟩
findmnt(1), nsenter(1), setpriv(1), unshare(1), clone(2), unshare(2),
namespaces(7), mount(8)
This page is part of the util-linux (a random collection of Linux
utilities) project. Information about the project can be found at
⟨https://www.kernel.org/pub/linux/utils/util-linux/⟩. If you have a
bug report for this manual page, send it to
util-linux@vger.kernel.org. This page was obtained from the
project's upstream Git repository
⟨git://git.kernel.org/pub/scm/utils/util-linux/util-linux.git⟩ on
2020-07-14. (At that time, the date of the most recent commit that
was found in the repository was 2020-07-14.) If you discover any
rendering problems in this HTML version of the page, or you believe
there is a better or more up-to-date source for the page, or you have
corrections or improvements to the information in this COLOPHON
(which is not part of the original manual page), send a mail to
man-pages@man7.org
secisol-tools 2020-07-01 NSCREATE(1)