Читать книгу Cloud Native Security - Chris Binnie - Страница 20

Kernel Capabilities

To inspect the innards of a Linux system and how they relate to containers in practice, we need to look a little more at kernel capabilities. The kernel is important because before other security hardening techniques were introduced in later versions, Docker allowed (and still does) the ability to disable certain features, and open up specific, otherwise locked-down, kernel permissions.

You can find out about Linux kernel capabilities by using the command $ man capabilities (or by visiting man7.org/linux/man-pages/man7/capabilities.7.html).

The manual explains that capabilities offer a Linux system the ability to run permission checks against each system call (commonly called syscalls) that is sent to the kernel. Syscalls are used whenever a system resource requests anything from the kernel. That could involve access to a file, memory, or another process among many other things, for example. The manual explains that during the usual run of events on traditional Unix-like systems, there are two categories of processes: any privileged process (belonging to the root user) and unprivileged processes (which don't belong to the root user). According to the Kernel Development site (lwn.net/1999/1202/kernel.php3), kernel capabilities were introduced in 1999 via the v2.1 kernel. Using kernel capabilities, it is possible to finely tune how much system access a process can get without being the root user.

By contrast, cgroups or control groups were introduced into the kernel in 2006 after being designed by Google engineers to enforce quotas for system resources including RAM and CPU; such limitations are also of great benefit to the security of a system when it is sliced into smaller pieces to run containers.

The problem that kernel capabilities addressed was that privileged processes bypass all kernel permission checks while all nonroot processes are run through security checks that involve monitoring the user ID (UID), group ID (GID), and any other groups the user is a member of (known as supplementary groups). The checks that are performed on processes will be against what is called the effective UID of the process. In other words, imagine that you have just logged in as a nonroot user chris and then elevate to become the root user with an su- command. Your “real UID” (your login user) remains the same; but after you elevate to become the superuser, your “effective UID” is now 0, the UID for the root user. This is an important concept to understand for security, because security controls need to track both UIDs throughout their lifecycle on a system. Clearly you don't want a security application telling you that the root user is attacking your system, but instead you need to know the “real UID,” or the login user chris in this example, that elevated to become the root user instead. If you are ever doing work within a container for testing and changing the USER instruction in the Dockerfile that created the container image, then the id command is a helpful tool, offering output such as this so you can find out exactly which user you currently are:

uid=0(root) gid=0(root) groups=0(root)

Even with other security controls used within a Linux system running containers, such as namespaces that segregate access between pods in Kubernetes and OpenShift or containers within a runtime, it is highly advisable never to run a container as the root user. A typical Dockerfile that prevents the root user running within the container might be created as shown in Listing1.1.

Listing 1.1: A Simple Example Dockerfile of How to Spawn a Container as Nonroot

FROM debian:stable USER root RUN apt-get update && apt-get install -y iftop && apt-get clean USER nobody CMD bash

In Listing 1.1, the second line explicitly states that the root user is initially used to create the packages in the container image, and then the nobody user actually executes the final command. The USER root line isn't needed if you build the container image as the root user but is added here to demonstrate the change between responsibilities for each USER clearly.

Once an image is built from that Dockerfile, when that image is spawned as a container, it will run as the nobody user, with the predictable UID and GID of 65534 on Debian derivatives or UID/GID 99 on Red Hat Enterprise Linux derivatives. These UIDs or usernames are useful to remember so that you can check that the permissions within your containers are set up to suit your needs. You might need them to mount a storage volume with the correct permissions, for example.

Now that we have covered some of the theory, we'll move on to a more hands-on approach to demonstrate the components of how a container is constructed. In our case we will not use the dreaded --privileged option, which to all intents and purposes gives a container root permissions. Docker offers the following useful security documentation about privileges and kernel capabilities, which is worth a read to help with greater clarity in this area:

docs.docker.com/engine/reference/run/

#runtime-privilege-and-linux-capabilities

The docs describe Privileged mode as essentially enabling “…access to all devices on the host as well as [having the ability to] set some configuration in AppArmor or SElinux to allow the container nearly all the same access to the host as processes running outside containers on the host.” In other words, you should rarely, if ever, use this switch on your container command line. It is simply the least secure and laziest approach, widely abused when developers cannot get features to work. Taking such an approach might mean that a volume can only be mounted from a container with tightened permissions onto a host's directory, which takes more effort to achieve a more secure outcome. Rest assured, with some effort, whichever way you approach the problem there will be a possible solution using specific kernel capabilities, potentially coupled with other mechanisms, which means that you don't have to open the floodgates and use Privileged mode.

For our example, we will choose two of the most powerful kernel capabilities to demonstrate what a container looks like, from the inside out. They are CAP_SYS_ADMIN and CAP_NET_ADMIN (commonly abbreviated without CAP_ in Docker and kernel parlance).

The first of these enables a container to run a number of sysadmin commands to control a system in ways a root user would. The second capability is similarly powerful but can manipulate the host's and container network stack. In the Linux manual page (man7.org/linux/man-pages/man7/capabilities.7.html) you can see the capabilities afforded to these --cap-add settings within Docker.

From that web page we can see that Network Admin (CAP_NET_ADMIN) includes the following:

Interface configuration

Administration of IP firewall

Modifying routing tables

Binding to any address for proxying

Switching on promiscuous mode

Enabling multicasting

We will start our look at a container's internal components by running this command:

$ docker run -d --rm --name apache -p443:443 httpd:latest

We can now check that TCP port 443 is available from our Apache container (Apache is also known as httpd) and that the default port, TCP port 80, has been exposed as so:

$ docker ps IMAGE COMMAND CREATED STATUS PORTS NAMES httpd "httpd-foreground" 36 seconds ago Up 33s 80/tcp, 443->443/tcp apache

Having seen the slightly redacted output from that command, we will now use a second container (running Debian Linux) to look inside our first container with the following command, which elevates permissions available to the container using the two kernel capabilities that we just looked at:

$ docker run --rm -it --name debian --pid=container:apache \ --net=container:apache --cap-add sys_admin debian:latest

We will come back to the contents of that command, which started a Debian container in a moment. Now that we're running a Bash shell inside our Debian container, let's see what processes the container is running, by installing the procps package:

root@0237e1ebcc85: /# apt update; apt install procps -y root@0237e1ebcc85: /# ps -ef UID PID PPID C STIME TTY TIME CMD root 1 0 0 15:17 ? 00:00:00 httpd -DFOREGROUND daemon 9 1 0 15:17 ? 00:00:00 httpd -DFOREGROUND daemon 10 1 0 15:17 ? 00:00:00 httpd -DFOREGROUND daemon 11 1 0 15:17 ? 00:00:00 httpd -DFOREGROUND root 93 0 0 15:45 pts/0 00:00:00 bash root 670 93 0 15:51 pts/0 00:00:00 ps -ef

We can see from the ps command's output that bash and ps -ef processes are present, but additionally several Apache web server processes are also shown as httpd. Why can we see them when they should be hidden? They are visible thanks to the following switch on the run command for the Debian container:

--pid=container:apache

In other words, we have full access to the apache container's process table from inside the Debian container.

Now try the following commands to see if we have access to the filesystem of the apache container:

root@0237e1ebcc85: cd /proc/1/root root@0237e1ebcc85: ls bin boot dev etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var

There is nothing too unusual from that directory listing. However, you might be surprised to read that what we can see is actually the top level of the Apache container filesystem and not the Debian container's. Proof of this can be found by using this path in the following ls command:

root@0237e1ebcc85: ls usr/local/apache2/htdocs usr/local/apache2/htdocs/index.html

As suspected, there's an HTML file sitting within the apache2 directory:

root@0237e1ebcc85:/proc/1/root# cat usr/local/apache2/htdocs/index.html <html><body><h1>It works!</h1></body></html>

We have proven that we have visibility of the Apache container's process table and its filesystem. Next, we will see what access this switch offers us:--net=container:apache.

Still inside the Debian container we will run this command:

root@0237e1ebcc85:/proc/1/root# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 10: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0 valid_lft forever preferred_lft forever

The slightly abbreviated output from the ip a command offers us two network interfaces, lo for the loopback interface and eth0, which has the IP address 172.17.0.2/16.

Let's exit the Debian container by pressing Ctrl+D and return to our normal system prompt to run a quick test. We named the container apache, so using the following inspect command we can view the end of the output to get the IP address for the Apache container:

$ docker inspect apache | tail -20

Listing 1.2 shows slightly abbreviated output from that command, and lo and behold in the IP Address section we can see the same IP address we saw from within the Debian container a moment ago, as shown in Listing 1.2: "IPAddress": "172.17.0.2".

Listing 1.2: The External View of the Apache Container's Network Stack

"Networks": { "bridge": { "IPAMConfig": null, "Links": null, "Aliases": null, "NetworkID": […snip…] "Gateway": "172.17.0.1", "IPAddress": "172.17.0.2", "IPPrefixLen": 16, "IPv6Gateway": "", "GlobalIPv6Address": "", "GlobalIPv6PrefixLen": 0, "MacAddress": "02:42:ac:11:00:02", "DriverOpts": null } } } } ]

Head back into the Debian container now with the same command as earlier, shown here:

$ docker run --rm -it --name debian --pid=container:apache \--net=container:apache --cap-add sys_admin debian:latest

To prove that the networking is fully passed across to the Debian container from the Apache container, we will install the curl command inside the container:

root@0237e1ebcc85:/# apt update; apt install curl -y

After a little patience (if you've stopped the Debian container, you'll need to run apt update before the curl command for it to work; otherwise, you can ignore it) we can now check what the intertwined network stack means from an internal container perspective with this command:

root@0237e1ebcc85:/# curl -v http://localhost:80 <html><body><h1>It works!</h1></body></html>

And, not straight from the filesystem this time but served over the network using TCP port 80, we see the HTML file saying, “It works!”

As we have been able to demonstrate, a Linux system does not need much encouragement to offer visibility between containers and across all the major components of a container. These examples should offer an insight into how containers reside on a host and how easy it is to potentially open security holes between containerized workloads.

Again, because containers are definitely not the same as virtual machines, security differs greatly and needs to be paid close attention to. If a container is run with excessive privileges or punches holes through the security protection offered by kernel capabilities, then not only are other containers at serious risk but the host machine itself is too. A sample of the key concerns of a “container escape” where it is possible to “break out” of a host's relatively standard security controls includes the following:

Disrupting services on any or all containers on the host, causing outages

Attacking the underlying host to cause a denial of service by causing a stress event with a view to exhaust available resources, whether that be RAM, CPU, disk space capacity, or I/O, for example

Deleting data on any locally mounting volumes directly on the host machine or wiping critical host directories that cause system failure

Embedding processes on a host that may act as a form of advanced persistent threat (APT), which could lie dormant for a period of time before being taken advantage of at a later date

Подняться наверх