n2(1) — Linux manual page

NAME | SYNOPSIS | DESCRIPTION | OPTIONS | EXIT STATUS | NOTES | EXAMPLES | AUTHORS | SEE ALSO | COLOPHON

NSCREATE(1)                     User Commands                    NSCREATE(1)

NAME         top

       nscreate - run program in new namespaces

SYNOPSIS         top

       nscreate [options] [program [arguments]]

DESCRIPTION         top

       The nscreate command creates new namespaces (as specified by the
       command-line options described below) and then executes the specified
       program with arguments.

       nsenter provides two modes of operation.  The default mode uses the
       clone(2) system call to create a child process that is placed in the
       new namespaces and which executes program.  The other mode, employed
       when the --unshare option is specified, uses unshare(2) to create the
       new namespaces and then directly executes program.

       By default, a new namespace remains in existence only as long as it
       has at least one member process.  A namespace can be made persistent—
       that  is, pinned into existence even when it has no member processes—
       by bind mounting the corresponding /proc/PID/ns/ns-type file.
       nscreate provides command-line options to simplify the creation of
       such bind mounts.  A persistent namespace can later be entered using
       nsenter(1), even after program has terminated.  A persistent
       namespace can be unpinned by unmounting the bind mount.

       If program is not supplied, then the program identified by the SHELL
       environment variable is run; if SHELL is not defined, then /bin/sh is
       executed.

       The following types of namespaces can be created using nscreate:

       cgroup namespace
              Cgroup namespaces virtualize the view of cgroups seen in
              /proc/[pid]/cgroup and /proc/[pid]/mountinfo.  For further
              details, see cgroup_namespaces(7) and cgroups(7).

       IPC namespace
              Processes within an IPC namespace have private instances of
              certain interprocess communication resources, namely System V
              IPC objects (message queues, semaphores, shared memory) and
              POSIX message queues.  For further details, see
              ipc_namespaces(7).

       mount namespace
              Processes within a mount namespace share a set of mount
              points.  Processes in different mount namespaces thus see
              distinct single directory hierarchies.  Mounting and
              unmounting filesystems in one mount namespace will not affect
              processes in other mount namespaces, except where a mount
              point has shared propagation.  For further details, see
              mount_namespaces(7), mount(2), mount(8), and the kernel source
              file Documentation/filesystems/sharedsubtree.txt.

       network namespace
              Processes in a network network share private instances of
              various networking resources, such as networking devices, IPv4
              and IPv6 protocol stacks, routing tables, firewall rules, and
              socket port numbers.  Thus, for example, each network
              namespace can have its own (virtual) network device with its
              own IP address, and each network namespace can have a web
              server running on port 80.  For further details, see
              network_namespaces(7).

       PID namespace
              PID namespaces isolate the PID number space, meaning that the
              PIDs of processes within a PID namespace are private to that
              namespace.  For further details, see pid_namespaces(7).

       time namespace
              Time namespaces virtualize the values of certain system
              clocks, namely the boot-time and monotonic clocks.  Thus, the
              processes within a time namespace share the same values for
              these clocks, but the values of the clocks may be different in
              other time namespaces.  For further details, see
              time_namespaces(7).

       user namespace
              User namespaces virtualize certain security-related
              identifiers and attributes, such as user IDs, group IDs, and
              capabilities.  Practically speaking, this means that a process
              may have certain credentials—for example, UID and GID 0, and
              all capabilities (i.e., superuser powers)—inside a user
              namespace, while at the same time having nonzero credentials
              and no capabilities outside that user namespace.  For further
              details, see user_namespaces(7), capabilities(7), and
              credentials(7).

       UTS namespace
              Processes within a UTS namespace share a private instance of
              two system identifiers: the hostname and the NIS domain name.
              For further details, see uts_namespaces(7).

OPTIONS         top

   Options for creating namespaces
       The following options can be used to create new namespaces.  The
       short-form options take no argument.  The long-form options take an
       optional argument, which is the pathname of an existing file that
       will be used as the target when creating a bind mount in order to
       make the namespace persistent.

       -c, --cgroup[=pathname]
              Create a new cgroup namespace.  If pathname is specified, then
              the namespace is made persistent by bind mounting the
              corresponding /proc/PID/ns/cgroup magic link on the regular
              file specified by pathname.

       -i, --ipc[=pathname]
              Create a new IPC namespace.  If pathname is specified, then
              the namespace is made persistent by bind mounting the
              corresponding /proc/PID/ns/ipc magic link on the regular file
              specified by pathname.

       -m, --mount[=pathname]
              Create a new mount namespace.  If pathname is specified, then
              the namespace is made persistent by bind mounting the
              corresponding /proc/PID/ns/mnt magic link on the regular file
              specified by pathname.  Note that creating this bind mount
              will fail if the propagation type of the parent mount of
              pathname is shared.  (The kernel disallows creation of the
              bind mount in this scenario because propagation of the mount
              point might lead to a circular dependency that would mean that
              the mount namespace could never be freed.)  See EXAMPLES for
              an example of how to ensure that the parent mount does not
              have shared propagation.

       -p, --pid[=pathname]
              Create a new PID namespace.  If pathname is specified, then
              the namespace is made persistent by bind mounting the
              corresponding /proc/PID/ns/pid_for_children magic link on the
              regular file specified by pathname.  Note, however, that even
              if a PID namespace is made persistent, it will no longer be
              usable (e.g., it can't be entered with nsenter(1)) if its init
              process has terminated.

              If the --unshare option is also employed, then the --fork
              option must additionally be employed in order to create the
              bind mount.

       -n, --net[=pathname]
              Create a new network namespace.  If pathname is specified,
              then the namespace is made persistent by bind mounting the
              corresponding /proc/PID/ns/net magic link on the regular file
              specified by pathname.

       -t, --time[=pathname]
              Create a new time namespace.  If pathname is specified, then
              the namespace is made persistent by bind mounting the
              corresponding /proc/PID/ns/time_for_children magic link on the
              regular file specified by pathname.

              In order to create a new time namespace, the --unshare option
              must also be specified (or an error results).  Typically, you
              will also want to specify the --fork option, so that command
              is run in a process in the new namespace; without the --fork
              option, only the child processes created by command will
              reside in the new namespace.

              See also --boottime and --monotonic.

       -u, --uts[=pathname]
              Create a new UTS namespace.  If pathname is specified, then
              the namespace is made persistent by bind mounting the
              corresponding /proc/PID/ns/uts magic link on the regular file
              specified by pathname.

       -U, --user[=pathname]
              Create a new user namespace.  If pathname is specified, then
              the namespace is made persistent by bind mounting the
              corresponding /proc/PID/ns/user magic link on the regular file
              specified by pathname.

   Other options
       -r, --map-root-user
              When creating a new user namespace, create the so-called root
              credential mappings: the user's UID and GID (i.e., the
              effective UID and GID under which nscreate is being run) are
              mapped to 0 (root) inside the new user namespace, before
              program is executed.  This means that the process that
              executes program will maintain root privileges (i.e., all
              capabilities) in the user namespace.  (Without this option,
              the process's capabilities will be cleared during execve(2),
              as described in capabilities(7).)

              This option can be employed only when creating a user
              namespace (--user).

       --uid-map=map
              When creating a new user namespace, this option can be used
              (subject to permissions rules described in user_namespaces(7))
              to define an arbitrary UID map for the new namespace.

              The map string consists of a series of numeric three-tuples of
              the form:

                  <ID-inside-NS> <ID-outside-NS> <length>

              The tuples must be separated either by newline characters or
              by commas (which are replaced by newline characters before the
              strings are written to the map files).  For a description of
              the meaning of the three numbers in each tuple, see
              user_namespaces(7); see also EXAMPLES, below.

              This option can be employed only when creating a user names‐
              pace (--user).

       --gid-map=map
              When creating a new user namespace, this option can be used
              (subject to permissions rules) to define an arbitrary GID map
              for the new namespace.

              The syntax of map is as for --uid-map.

              This option can be employed only when creating a user names‐
              pace (--user).

       --boottime
              When creating a new time namespace, this option can be used to
              specify the offset of the boot-time (CLOCK_BOOTTIME) clock, in
              seconds.

              This option can be specified only when creating a time names‐
              pace (--time).

       --monotonic
              When creating a new time namespace, this option can be used to
              specify the offset of the monotonic (CLOCK_MONOTONIC) clock,
              in seconds.

              This option can be specified only when creating a time names‐
              pace (--time).

       --no-deny-setgroups
              By default, when creating a user namespace, execution of the
              setgroups(2) system call is disabled by writing the string
              "deny" to the /proc/PID/setgroups file of a process inside the
              namespace.  (For details of the reasons why, see
              user_namespaces(7)).)  This option can be used to disable the
              step of modifying /proc/PID/setgroups in this way.

              If you are not superuser (more precisely, you do not have the
              CAP_SETGID capability), then updating the GID map for the user
              namespace is likely to fail (and thus nscreate itself will
              fail).

              This option can be employed only when creating a user names‐
              pace (--user).

       --unshare
              By default, nscreate performs its task by using clone(2) to
              create the requested namespaces and create a child that exe‐
              cutes program in those namespaces.  If the  --unshare option
              is specified, then the new namespaces are instead created
              using unshare(2), and program is executed directly (so that it
              replaces the nscreate program).

              Uses of the --unshare option include the following:

              · This option must be used when creating time namespaces,
                since current kernels don't support the creation of time
                namespaces using clone().

              · This option can be useful in commands of the following form,
                where the shell itself is ultimately replaced by program:

                    $ exec nscreate -Ur --unshare program

              Note that when using the --unshare option, the only mappings
              that can be defined using --uid-map and --gid-map are mappings
              that map just the user's UID and GID (i.e., ID-outside-NS must
              be the user's own UID/GID and length must be 1).

       -f, --fork
              Execute program in a child process created using fork(2).
              This option can be employed only in conjunction with --unshare
              or --pid.

              Using this option when creating a PID namespace (--pid)
              ensures that program is executed as PID 1 in the new PID
              namespace.  This option is likewise useful when creating a
              time namespace (--time) in order to ensure that program is run
              in the new time namespace.

       --propagation=type
              This option is provided as a convenience for setting the prop‐
              agation type of all the mount points in a new mount namespace.
              type is one of the following:

              private
                     Give all mount points "private" propagation.  This is
                     the default, so that mount and unmount operations do
                     not have unintended side effects in the previous mount
                     namespace.

              shared Give all mount points "shared" propagation.

              slave  Give all mount points "slave" propagation.

              unchanged
                     Do not change the propagation type of mounts in the new
                     mount namespace; mount points preserve the propagation
                     that they had in the namespace from which they were
                     inherited.

              Thus, for example, --propagation=private has the same effect
              as issuing the following shell command in a new mount names‐
              pace:

                  # mount --make-rprivate /

              For further information on mount propagation, see
              mount_namespaces(7), mount(2), mount(8), and the kernel source
              file Documentation/filesystems/sharedsubtree.txt.

              The --propagation option can be employed only when creating a
              new mount namespace (--mount).  If both --propagation and
              --mount-proc are specified, --propagation is actioned first.

       --mount-proc
              Mount a proc(5) filesystem at /proc.  This option is provided
              as a convenience when creating a new PID namespace (--pid):
              the provision of a new /proc mount ensures that tools such as
              ps(1) and top(1) work correctly inside the new PID namespace.

              The new mount is created by first setting the propagation type
              of the existing /proc mount to private and then stacking a new
              proc(5) mount at /proc.  Since the parent of the stacked mount
              is the previous /proc mount (which now has private propaga‐
              tion), the stacked mount will not propagate to other mount
              namespaces.

              This option can be employed only in conjunction with the use
              of the --mount/-m option to create a new mount namespace.

       --child-exit-sig[=sig]
              If nscreate terminates, then send the specified signal to the
              child process that is executing program.  The signal may be
              specified by name (e.g., quit) or number (e.g., 3).  If sig is
              omitted, the default is kill (SIGKILL).

              This option is mainly useful when creating a PID namespace, in
              order to ensure that all processes in the namespace are termi‐
              nated if nscreate itself terminates.  This happens because the
              child process running program has PID 1 (i.e., it is the init
              process in the PID namespace) and if PID 1 in a PID namespace
              terminates, then the kernel terminates all other processes in
              the namespace.  Note that aside from SIGKILL and SIGSTOP, the
              only signals that can be sent to the init process of a PID
              namespace are those signals for which the process has estab‐
              lished a handler; see pid_namespaces(7).

              This option can be employed only in the default "clone" mode
              or in combination with both --unshare and --fork.

       --no-new-privs
              Set the no_new_privs process attribute (see prctl(2)).  This
              prevents the process from changing credentials or gaining
              capabilities if program is a set-user-ID or set-group-ID pro‐
              gram, or has file capabilities attached.

       -h, --help
              Display help text and exit.

   Repeatable options
       The following options are performed, in the order given on the com‐
       mand-line, after all other options are actioned, just before program
       is executed.  Some of these options can (meaningfully) be repeated.

       --make-caps-inheritable
              Copy the process's permitted capabilities to the inheritable
              set.  This allows a process with nonzero UIDs to preserve
              those capabilities when executing a new program, so long as
              the program being executed also has inheritable capabilities.
              This option is useful primarily when creating a user names‐
              pace.

       --make-caps-ambient
              Copy the process's permitted capabilities to the inheritable
              and ambient sets.  (Copying to the inheritable set is neces‐
              sary because a capability can't be made ambient without first
              being made inheritable.)  This allows a process with nonzero
              UIDs to preserve capabilities when executing a new program.
              This option is useful primarily when creating a user names‐
              pace.

       --setuid={uid|ruid,euid,suid}
              Set the process user IDs.  If just one user ID is specified,
              set the real, effective, and saved set user IDs to the speci‐
              fied (numeric) value.  If three (numeric) UIDs are specified,
              then use those values to set, respectively, the process's
              real, effective, and saved set user IDs; in this case, -1 can
              be specified as a value, meaning leave the corresponding UID
              unchanged.

              The UID(s) are interpreted relative to the UID map of the user
              namespace.

              This option is useful primarily when creating a new user
              namespace.

       --setgid={gid|rgid,egid,sgid}
              Set the process group IDs.  If just one group ID is specified,
              set the real, effective, and saved set group IDs to the speci‐
              fied (numeric) value.  If three (numeric) GIDs are specified,
              then use those values to set, respectively, the process's
              real, effective, and saved set group IDs.

              The GID(s) are interpreted relative to the GID map of the user
              namespace.

              This option is useful primarily when creating a new user
              namespace.

       --clear-groups
              Clear the supplementary group list.  If the --no-deny-set‐
              groups option is not also specified, an error results.  An
              error likewise results if nscreate is not run as superuser
              (or, more precisely, with the CAP_SETGID capability).

              This option is useful when creating a user namespace in order
              to ensure that the process running program does not inherit a
              set of supplementary groups that do no exist in the GID map.

       --secbits=[+-]flag[,...]
              Set the process securebits flags.  The option value is either
              0, meaning clear all securebits flags (if possible), or a
              comma-separated list of flag names, optionally preceding by a
              plus ('+') or ('-') sign.  If a plus sign is specified, then
              the specified flags are enabled, while the remaining flags are
              left unchanged.  If a minus sign is specified, then the speci‐
              fied flags are disabled, while the remaining flags are left
              unchanged.  If neither a plus nor a minus sign is specified,
              then the specified flags are set and the remaining flags are
              cleared.

              The flags, which can be specified in either long or abbrevi‐
              ated form, are as follows:

              keep_caps / kc
                     Set the SECBIT_KEEP_CAPS flag, so that a process with
                     one or more zero user IDs does not lose permitted capa‐
                     bilities when it makes all of its user IDs nonzero.

                     This flag is automatically cleared when a new program
                     is executed.  Thus, even if it is set, this flag will
                     be cleared when program is executed.

                     This flag provides a subset of the functionality of the
                     no_setuid_fixup flag, and is ignored by the kernel if
                     that flag is also set.

              keep_caps_locked / kcl
                     Set the SECBIT_KEEP_CAPS_LOCKED flag.

                     Note: setting this flag does not prevent the
                     SECBIT_KEEP_CAPS flag from being cleared when a new
                     program is executed.

              no_setuid_fixup / nsf
                     Set the SECBIT_NO_SETUID_FIXUP flag, so that switching
                     the process's capabilities between zero and nonzero
                     values does not cause any changes to the process's
                     capabilities.

              no_setuid_fixup_locked / nsfl
                     Set the SECBIT_NO_SETUID_FIXUP_LOCKED flag.

              noroot / nr
                     Set the SECBIT_NOROOT flag, so that the kernel does not
                     grant any capabilities to the process if it executes a
                     set-user-ID-root program or if the process executes a
                     program while having a real or effective user ID of
                     zero.

              noroot_locked / nrl
                     Set the SECBIT_NOROOT_LOCKED flag.

              no_cap_ambient_raise / ncar
                     Set the SECBIT_NO_CAP_AMBIENT_RAISE flag, so that the
                     kernel no longer permits the process to raise capabili‐
                     ties in its ambient set.

              no_cap_ambient_raise_locked / ncarl
                     Set the SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED flag.

              As can be seen from the above list, the flags are organized in
              pairs: a "base" flag, and a corresponding "locked" flag.  Set‐
              ting a "locked" flag makes the corresponding "base" flag
              immutable, and once set, a "locked" flag can't be cleared.

              In order to modify the securebits settings, nscreate must be
              run as superuser (or, more precisely, with the CAP_SETPCAP
              capability).

              For further details on the securebits feature, see
              capabilities(7).

       --set-caps=cap-spec
              Set process permitted, effective, and inheritable capabili‐
              ties.  cap-spec is as per the argument of cap_from_text(3).

       --adj-caps=spec
              Adjust the process capability sets.  The argument has one of
              the following forms:

                  <flags><op>all
                  <flags><op>[~]<cap>,...

              flags specifies the capability sets to be modified and is a
              list of one of more of the following letters: p (permitted); e
              (effective); i (inheritable); a (ambient); and b (bounding).

              op is either '+' or '-' indicating that the specified capabil‐
              ities are to be added or removed from the sets specified by
              flags.

              Following op there may be either the word all, meaning all
              capabilities, or a list of capabilities optionally preceded by
              a tilde (~).  In the latter form, capabilities can be speci‐
              fied either symbolically (e.g., cap_kill) or as numbers.  The
              optional tilde causes the set to be inverted; thus,
              ~cap_sys_admin means all capabilities except CAP_SYS_ADMIN.

              The capability sets are modified as individual operations, in
              the order given in flags.

              Note that there are many rules governing the changes that can
              be made to capability sets; see capabilities(7) for details.

       --dump[=opts]
              Dump the current state of security-related information for the
              process.  This may be useful when debugging your understanding
              of the effect of various nscreate command-line options.

              The optional argument, opts, is a comma-separated list of one
              or more of the following:

              eids   Dump process effective user ID and effective group ID.
                     This option is ignored if creds is also specified.

              creds  Dump all (i.e., real, effective, saved set) user and
                     group IDs.

              groups Dump the process supplementary group list.

              caps   Dump process permitted, effective, and inheritable
                     capabilities in the format produced by cap_to_text(3).

              secbits
                     Dump the process securebits flags.

              If opts is omitted, the default is eids,caps.

       --wait=num-secs
              Pause execution for num-secs seconds.  This allows you to per‐
              form other actions (e.g., inspecting the state of the process
              from another terminal) during the pause.  As with --dump, this
              option may be useful when debugging your understanding of the
              effect of various nscreate command-line options.

EXIT STATUS         top

       If program is successfully executed, then the exit status of nscreate
       is the exit status of program.  If program could not be executed
       (e.g., because a child process could not be created, or program could
       not be found), then nscreate exits with the status 1.

NOTES         top

   Command-line interface design philosophy
       nscreate shares much functionality with unshare(1) while at the same
       time adding various functionality that is absent from the latter
       command.  Aside from the functional differences, there are also some
       differences in design philosophy, including the following:

       · The use of single-letter options is generally reserved for
         frequently used options that either do not take arguments, or take
         arguments that are short.  This design choice reflects the fact
         that there is a limited set of possible single-letter options, and
         therefore the option letters should be consumed only in cases where
         it is obviously useful to do so.  This conservative approach leaves
         single-letter options available for options that may be added in
         the future.  By contrast, the liberal use of single-letter options
         in unshare(1) has led to some common options using surprising
         single-letter options because the more obvious letter choice had
         already been used by a previously added option that is in some
         cases rarely used.

       · "Minor" options generally do not imply "major" options.  For
         example (and unlike unshare(1)), the -r option does not imply the
         -U option.  The rationale here is that requiring the use of the
         necessary major option leads to commands and scripts that are more
         readable, and less prone to surprising errors.  (For example, it is
         quite easy to overlook that unshare -r creates a user namespace,
         especially if the -r is interspersed with other options.)

EXAMPLES         top

       In the following shell session, nscreate is used to run a shell in a
       new user namespace (with root credential mappings) and a new UTS
       namespace.  In the new user namespace, the shell has a full set of
       permitted and effective capabilities (=ep), which allows the user to
       change the hostname in the new UTS namespace.

           $ hostname              # Show hostname in initial UTS namespace
           bienne
           $ nscreate -Uur bash --norc
           # getpcaps $$
           Capabilities for `24893' =ep
           # hostname orinoco
           # hostname
           orinoco

       The following command is equivalent to nscreate -Ur:

           $ nscreate --user --uid_map="0 $(id -u) 1" --gid-map="0 $(id -g) 1"

   Creating a persistent mount namespace
       The following shell session demonstrates the creation of a mount
       namespace that is made persistent via a bind mount on the pathname
       /mnt/ns/mp.  We first observe that initially the nearest ancestor
       mount point is the root mount (/), which has shared propagation.  As
       noted above, this would lead to a failure when creating the bind
       mount on /mnt/ns/mp.  To avoid this, we create an intermediate bind
       mount point whose propagation is made private.  We then create the
       mount point pathname and use nscreate to create a persistent mount
       namespace.

           # findmnt -n -o target,propagation --target /mnt/ns
           /      shared
           # mount --bind /mnt /mnt
           # mount --make-private /mnt
           # touch /mnt/ns/mp
           # nscreate --mount=/mnt/ns/mp sh

   Creating a time namespace
       The following example demonstrates the creation of a time namespace
       where the boottime clock is set to a point several years in the past:

           # uptime -p             # Show uptime in initial time namespace
           up 30 minutes
           # nscreate --unshare --fork --time --boottime=200000000 uptime -p
           up 6 years, 18 weeks, 4 days, 20 hours, 4 minutes

   Creating a PID namespace with a properly mounted /proc filesystem
       In the following example, a shell is executed in new user, PID, and
       mount namespaces.  The --mount-proc option ensures a correctly
       mounted /proc filesystem for the new PID namespace, so that ps(1)
       shows correct output for the PID namespace.

           # PS1='ns2# ' nscreate -Urpm --mount-proc bash --norc
           ns2# ps a
               PID TTY      STAT   TIME COMMAND
                 1 pts/4    S      0:00 bash --norc
                 2 pts/4    R+     0:00 ps a

   Creating arbitrary UID and GID maps in a user namespace
       In the following commands, we take a local copy of the nscreate exe‐
       cutable and then assign suitable capabilities to that copy so that an
       unprivileged user can create a user namespace with arbitrary UID and
       GID maps.

           $ cp $(which nscreate) .
           $ sudo setcap cap_setuid,cap_setgid=pe nscreate
           $ id -u; id -g
           1000
           1000
           $ ./nscreate -U --uid-map="0 1000 10, 10 2000 10" \
                           --gid-map="0 1000 10" bash --norc
           # id -u; id -g
           0
           0
           # cat /proc/self/uid_map
                    0       1000         10
                   10       2000         10

   Demonstration of the effect of UID transitions on capabilities
       In the following shell session, which is executed as an unprivileged
       user (UID 1000), we make a local copy of the nscreate executable and
       assign it the necessary capabilities that allow an unprivileged user
       to create a user namespace with UID and GID maps that map a range of
       IDs.  We then twice use the local nscreate executable to create a
       user namespace that maps UIDs (and GIDs) 1000 to 1009 to 0 to 9
       inside the namespace.  Within the user namespace, we run the
       getpcaps(8) program asking it show its own capabilities.

           $ id -u
           1000
           $ cp $(which nscreate) .
           $ sudo setcap cap_setuid,cap_setgid=pe nscreate
           $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
                           getpcaps 0
           Capabilities for `0': =ep
           $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
                           --setuid 1 getpcaps 0
           Capabilities for `0': =

       Looking at the results of the two executions of nscreate, we see that
       in the first case getpcaps(8) shows that it has all permitted and
       effective capabilities.  In the second case, we used the --setuid
       option to switch the process's UIDs from 0 to 1 inside the user
       namespace, and as a result getpcaps(8) shows that it has no capabili‐
       ties (because switching UIDs from 0 to nonzero causes the process's
       permitted and effective capabilities to be cleared).

   Capabilities, UID transitions, and securebits
       The kernel automatically assigns the first process in a new user
       namespace a full set of permitted and effective capabilities.  How‐
       ever, if the process changes its user IDs to nonzero values (inside
       the user namespace), it loses all capabilities (see capabilities(7)).

       We can illustrate this by once again making a local copy of the
       nscreate program and assigning capabilities to the copy which allow
       the creation of a user namespace with arbitrary UID and GID mappings.
       Initially, the process has UID 0 inside the user namespace, but we
       then use the --setuid option to switch the process's user IDs to a
       nonzero value.  The --dump option shows us that the process has con‐
       sequently lost all capabilities:

           $ id -u
           1000
           $ cp $(which nscreate) .
           $ sudo setcap cap_setuid,cap_setgid=pe nscreate
           $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
                           --setuid 1 --dump /bin/true
           eUID = 1;  eGID = 0
           capabilities: =

       A process can enable the SECBIT_NO_SETUID_FIXUP securebits flag so
       that its capabilities do not change when its user IDs transition
       between 0 and nonzero values.  We can illustrate this with the addi‐
       tion of the --secbits option to the preceding command:

           $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
                           --secbits=no_setuid_fixup \
                           --setuid 1 --dump /bin/true
           eUID = 1;  eGID = 0
           capabilities: =ep

   Ambient capabilities
       When a process that has nonzero UIDs executes a new program, its
       capabilities are transformed according to the rules described in
       capabilities(7).  In particular, if the process has capabilities,
       then in the usual case it will lose those capabilities during the
       execve(2).

       Using a local copy of nscreate with capabilities attached, we can see
       this behavior.  We create a user namespace that allows a range of
       UIDs.  The process in the user namespace switches its UID from 0 to
       1, after first setting the SECBIT_NO_SETUID_FIXUP securebits flag, so
       that the UID transition does not cause the process to lose capabili‐
       ties.  The --dump option allows us to verify that the process still
       has capabilities before it executes getpcaps(8).  However, after the
       execve(2), getpcaps(8) shows us that the process no longer has any
       capabilities.

           $ id -u
           1000
           $ cp $(which nscreate) .
           $ sudo setcap cap_setuid,cap_setgid=pe nscreate
           $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
                   --secbits=no_setuid_fixup --setuid 1 --dump getpcaps 0
           eUID = 1;  eGID = 0
           capabilities: =ep
           Capabilities for `0': =

       The ambient capability set can be used to prevent a process losing
       its capabilities in the above scenario.  Before executing a new pro‐
       gram, the process can place any of its permitted capabilities into
       the ambient set, and those capabilities will (as per the transforma‐
       tion rules described in capabilities(7)) be assigned to the process's
       permitted and effective capabilities after the execve(2).  Adding the
       --make-caps-ambient option, which copies the process's existing per‐
       mitted capabilities to the inheritable and ambient sets, can be used
       to demonstrate this behavior:

           $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
                   --make-caps-ambient \
                   --secbits=no_setuid_fixup --setuid 1 --dump getpcaps 0
           eUID = 1;  eGID = 0
           capabilities: =eip
           Capabilities for `0': =eip

       The last line, output by getpcaps(8), shows us that the process has
       all capabilities in its permitted, effective, and inheritable sets.

   Capabilities, exec, and securebits
       As noted above, the kernel automatically assigns the first process in
       a new user namespace a full set of permitted and effective capabili‐
       ties.  Suppose that the process then drops capabilities, while
       retaining user ID 0 (inside the namespace).  If the process subse‐
       quently executes a new program, it once more regains all capabilities
       because it did an execve(2) with UID 0 (see capabilities(7)).  The
       following command illustrates this behavior:

           $ id -u
           1000
           $ nscreate -Ur --set-caps = --dump getpcaps 0
           eUID = 0;  eGID = 0
           capabilities: =
           Capabilities for `0': =ep

       The --dump options shows us that just before getpcaps(8) was exe‐
       cuted, the process had no capabilities.  The last line of output,
       produced by getpcaps(8), shows us that the process once again has all
       capabilities.

       A process with user ID 0 can enable the SECBIT_NOROOT securebits flag
       so that its capabilities do not change when it executes a new pro‐
       gram.  To illustrate this, compare the output of the previous command
       with the output of the following command:

           $ nscreate -Ur --secbits=noroot --set-caps = --dump getpcaps 0
           eUID = 0;  eGID = 0
           capabilities: =
           Capabilities for `0': =ep

   User namespace "set-user-ID-root" programs
       Here, we demonstrate the operation of a per-user-namespace "set-user-
       ID-root" program (see capabilities(7)).  Again, we use a local copy
       of the nscreate executable that has capabilities assigned.  We then
       (as unprivileged user ID 1000) make a local copy of the getpcaps(8)
       program and make it a set-user-ID program (i.e., the program will
       cause the effective user ID of the executing process to switch to
       1000).  Finally, we use the local nscreate executable to create a
       user namespace that maps UIDs (and GIDs) 1000 to 1009 to 0 to 9
       inside the namespace, switch user ID to 1 inside the user namespace
       (which causes the process to lose all capabilities), and then execute
       the local copy of getpcaps(8).

           $ id -u
           1000
           $ cp $(which nscreate) .
           $ sudo setcap cap_setuid,cap_setgid=pe nscreate
           $ getcap nscreate     # Verify capabilities of local nscreate
           nscreate = cap_setgid,cap_setuid+ep
           $ cp $(which getpcaps) .
           $ chmod u+s getpcaps
           $ ls -ln getpcaps       # Verify ownership and permissions
           -rwsr-xr-x. 1 1000 1000 15992 Jun 10 08:24 getpcaps
           $ ./nscreate -U --uid-map='0 1000 10' --gid-map='0 1000 10' \
                           --setuid 1 ./getpcaps 0
           Capabilities for `0': =ep

       Looking at the above output, we see that the process running getpcaps
       has all capabilities (inside the user namespace).  This happened
       because the set-user-ID getpcaps executable caused the process's user
       ID to switch to 1000, which is equivalent to 0 inside the user names‐
       pace, and as a result the process gained all permitted and effective
       capabilities.

AUTHORS         top

       Michael Kerrisk ⟨mtk.manpages@gmail.com⟩

SEE ALSO         top

       findmnt(1), nsenter(1), setpriv(1), unshare(1), clone(2), unshare(2),
       namespaces(7), mount(8)

COLOPHON         top

       This page is part of the util-linux (a random collection of Linux
       utilities) project.  Information about the project can be found at 
       ⟨https://www.kernel.org/pub/linux/utils/util-linux/⟩.  If you have a
       bug report for this manual page, send it to
       util-linux@vger.kernel.org.  This page was obtained from the
       project's upstream Git repository
       ⟨git://git.kernel.org/pub/scm/utils/util-linux/util-linux.git⟩ on
       2020-07-14.  (At that time, the date of the most recent commit that
       was found in the repository was 2020-07-14.)  If you discover any
       rendering problems in this HTML version of the page, or you believe
       there is a better or more up-to-date source for the page, or you have
       corrections or improvements to the information in this COLOPHON
       (which is not part of the original manual page), send a mail to
       man-pages@man7.org

secisol-tools                    2020-07-01                      NSCREATE(1)