LXC 1.0: Security features [6/10]

This is post 6 out of 10 in the LXC 1.0 blog post series.

When talking about container security most people either consider containers as inherently insecure or inherently secure. The reality isn’t so black and white and LXC supports a variety of technologies to mitigate most security concerns.

One thing to clarify right from the start is that you won’t hear any of the LXC maintainers tell you that LXC is secure so long as you use privileged containers. However, at least in Ubuntu, our default containers ship with what we think is a pretty good configuration of both the cgroup access and an extensive apparmor profile which prevents all attacks that we are aware of.

Below I’ll be covering the various technologies LXC supports to let you restrict what a container may do. Just keep in mind that unless you are using unprivileged containers, you shouldn’t give root access to a container to someone whom you’d mind having root access to your host.

Capabilities

The first security feature which was added to LXC was Linux capabilities support. With that feature you can set a list of capabilities that you want LXC to drop before starting the container or a full list of capabilities to retain (all others will be dropped).

The two relevant configurations options are:

  • lxc.cap.drop
  • lxc.cap.keep

Both are lists of capability names as listed in capabilities(7).

This may sound like a great way to make containers safe and for very specific cases it may be, however if running a system container, you’ll soon notice that dropping sys_admin and net_admin isn’t very practical and short of dropping those, you won’t make your container much safer (as root in the container will be able to re-grant itself any dropped capability).

In Ubuntu we use lxc.cap.drop to drop sys_module, mac_admin, mac_override, sys_time which prevent some known problems at container boot time.

Control groups

Control groups are interesting because they achieve multiple things which while interconnected are still pretty different:

  • Resource bean counting
  • Resource quotas
  • Access restrictions

The first two aren’t really security related, though resource quotas will let you avoid some obvious DoS of the host (by setting memory, cpu and I/O limits).

The last is mostly about the devices cgroup which lets you define which character and block devices a container may access and what it can do with them (you can restrict creation, read access and write access for each major/minor combination).

In LXC, configuring cgroups is done with the “lxc.cgroup.*” options which can roughly be defined as: lxc.cgroup.<controller>.<key> = <value>

For example to set a memory limit on p1 you’d add the following to its configuration:

lxc.cgroup.memory.limit_in_bytes = 134217728

This will set a memory limit of 128MB (the value is in bytes) and will be the equivalent to writing that same value to /sys/fs/cgroup/memory/lxc/p1/memory.limit_in_bytes

Most LXC templates only set a few devices controller entries by default:

# Default cgroup limits
lxc.cgroup.devices.deny = a
## Allow any mknod (but not using the node)
lxc.cgroup.devices.allow = c *:* m
lxc.cgroup.devices.allow = b *:* m
## /dev/null and zero
lxc.cgroup.devices.allow = c 1:3 rwm
lxc.cgroup.devices.allow = c 1:5 rwm
## consoles
lxc.cgroup.devices.allow = c 5:0 rwm
lxc.cgroup.devices.allow = c 5:1 rwm
## /dev/{,u}random
lxc.cgroup.devices.allow = c 1:8 rwm
lxc.cgroup.devices.allow = c 1:9 rwm
## /dev/pts/*
lxc.cgroup.devices.allow = c 5:2 rwm
lxc.cgroup.devices.allow = c 136:* rwm
## rtc
lxc.cgroup.devices.allow = c 254:0 rm
## fuse
lxc.cgroup.devices.allow = c 10:229 rwm
## tun
lxc.cgroup.devices.allow = c 10:200 rwm
## full
lxc.cgroup.devices.allow = c 1:7 rwm
## hpet
lxc.cgroup.devices.allow = c 10:228 rwm
## kvm
lxc.cgroup.devices.allow = c 10:232 rwm

This configuration allows the container (usually udev) to create any device it wishes (that’s the wildcard “m” above) but block everything else (the “a” deny entry) unless it’s listed in one of the allow entries below. This covers everything a container will typically need to function.

You will find reasonably up to date documentation about the available controllers, control files and supported values at:
https://www.kernel.org/doc/Documentation/cgroups/

Apparmor

A little while back we added Apparmor profiles support to LXC.
The Apparmor support is rather simple, there’s one configuration option “lxc.aa_profile” which sets what apparmor profile to use for the container.

LXC will then setup the container and ask apparmor to switch it to that profile right before starting the container. Ubuntu’s LXC profile is rather complex as it aims to prevent any of the known ways of escaping a container or cause harm to the host.

As things are today, Ubuntu ships with 3 apparmor profiles meaning that the supported values for lxc.aa_profile are:

  • lxc-container-default (default value if lxc.aa_profile isn’t set)
  • lxc-container-default-with-nesting (same as default but allows some needed bits for nested containers)
  • lxc-container-default-with-mounting (same as default but allows mounting ext*, xfs and btrfs file systems).
  • unconfined (a special value which will disable apparmor support for the container)

You can also define your own by copying one of the ones in /etc/apparmor.d/lxc/, adding the bits you want, giving it a unique name, then reloading apparmor with “sudo /etc/init.d/apparmor reload” and finally setting lxc.aa_profile to the new profile’s name.

SELinux

The SELinux support is very similar to Apparmor’s. An SELinux context can be set using “lxc.se_context”.

An example would be:

lxc.se_context = unconfined_u:unconfined_r:lxc_t:s0-s0:c0.c1023

Similarly to Apparmor, LXC will switch to the new SELinux context right before starting init in the container. As far as I know, no distributions are setting a default SELinux context at this time, however most distributions build LXC with SELinux support (including Ubuntu, should someone choose to boot their host with SELinux rather than Apparmor).

Seccomp

Seccomp is a fairly recent kernel mechanism which allows for filtering of system calls.
As a user you can write a seccomp policy file and set it using “lxc.seccomp” in the container’s configuration. As always, this policy will only be applied to the running container and will allow or reject syscalls with a pre-defined return value.

An example (though limited and useless) of a seccomp policy file would be:

1
whitelist
103

Which would only allow syscall #103 (syslog) in the container and reject everything else.

Note that seccomp is a rather low level feature and only useful for some very specific use cases. All syscalls have to be referred by their ID instead of their name and those may change between architectures. Also, as things are today, if your host is 64bit and you load a seccomp policy file, all 32bit syscalls will be rejected. We’d need per-personality seccomp profiles to solve that but it’s not been a high priority so far.

User namespace

And last but not least, what’s probably the only way of making a container actually safe. LXC now has support for user namespaces. I’ll go into more details on how to use that feature in a later blog post but simply put, LXC is no longer running as root so even if an attacker manages to escape the container, he’d find himself having the privileges of a regular user on the host.

All this is achieved by assigning ranges of uids and gids to existing users. Those users on the host will then be allowed to clone a new user namespace in which all uids/gids are mapped to uids/gids that are part of the user’s range.

This obviously means that you need to allocate a rather silly amount of uids and gids to each user who’ll be using LXC in that way. In a perfect world, you’d allocate 65536 uids and gids per container and per user. As this would likely exhaust the whole uid/gid range rather quickly on some systems, I tend to go with “just” 65536 uids and gids per user that’ll use LXC and then have the same range shared by all containers.

Anyway, that’s enough details about user namespaces for now. I’ll cover how to actually set that up and use those unprivileged containers in the next post.

About Stéphane Graber

Project leader of Linux Containers, Linux hacker, Ubuntu core developer, conference organizer and speaker.
This entry was posted in Canonical voices, LXC, Planet Ubuntu and tagged . Bookmark the permalink.

20 Responses to LXC 1.0: Security features [6/10]

  1. Eric Du says:

    A puzzle about security. I already configure unattended security updates on host (ubuntu 12.04lts, will transit to 14.04), is it still necessary to do security updates for every LXC on the host?

    Thanks very much for the LXC blog post series, it’s really awesome.

    1. @Eric Du: Yes updating LXC containers is as important if you do not want containers compromised.

  2. kahamedr says:

    There is no rule as such to say which device type is allowed to be accessed in container right?
    According to lxc.cgroup.devices.allow syntax, I can even add a block device and have it accessed directly from my container. Like example cdrom or a storage device? Thanks!

  3. esokolov says:

    How can i put lxc-container’s cgroup into another cgroup ?
    I want to set a group limits for several lxc-containers.
    Is it possible?

    1. It’s not possible right now. It’s something we’d first need to add support for in liblxc before we can have LXD make use of it.
      We will most likely look into this as we work on multi-tenancy in LXD in the near future.

  4. Abhijit Sahu says:

    As per above given example (copied again below), If “Seccomp” is enabled for a container:

    “An example (though limited and useless) of a seccomp policy file would be:
    1
    whitelist
    103
    Which would only allow syscall #103 (syslog) in the container and reject everything else.”

    What would this refers to?
    – Consider i have two containers called C1 and C2.
    C1 container is having process P1 and P2.
    C2 container is having process P3 and P4.

    Assumption:
    C1 i have enabled “seccomp – which allows only syscall #103 (syslog) “. Which options is valid/true and Why?

    Option A:
    P1 & P2 is allowing to perform only syscall #103(syslog) And P1 & P2 cant call any other system call other than syscall?

    Option B:
    When P3 & P4 (which belongs to C2) trying to access the container C1 in that such case:
    whether P3 & P4 is allowing to perform only syscall #103(syslog) the And P3 & P4 cant call any other system call other than syscall to the container C2?

  5. Emmanuel Deloget says:

    Hi,

    I’m trying to understand the underlying reason why sys_time and other capabilities are droped. You say

    “In Ubuntu we use lxc.cap.drop to drop sys_module, mac_admin, mac_override, sys_time which prevent some known problems at container boot time.”

    Yet I cannot find any reference on the exact problem it causes. Do you have any pointer on that subject please ?

    Thanks !

    1. If sys_module was kept, a privileged container would be able to load kernel modules, escaping confinement.

      If sys_time was kept, the container would be able to alter the system clock of the host.

      If mac_override or mac_admin are kept, the container would be able to modify AppArmor and SELinux profiles to disable its own confinement or modify the confinement of processes on the host.

      1. Emmanuel Deloget says:

        Ok. Thanks for the response 🙂

        In turns, that means that any ntpclient has to run outside a container, meaning that if some day it’s abused it might give access to the whole server.

        Is this a use case (a container whose role is to sync the date/time of the hardware) that has been overlooked? Or is there a good, like really good security reason to avoid that? (this is part of a better discussion about sensible defaults and configuration possibilities :))

  6. Ganesh Sathyanarayanan says:

    I have a privileged container. I created the container as root on my device and start it as root.
    I have been trying to ‘expose’ the /dev/mem device to my container because some of the applications I run there need them.
    However, am unable to do so. I always end up with a “Operation not permitted” error when I try to open /dev/mem. The following are the different things I tried
    1) lxc-cgroup.devices.allow = c 1 1 (and doing a mknod /dev/mem c 1 1) on the container
    2) lxc-device -n — add /dev/mem (this causes /dev/mem to appear in the container without having to run any extra commands such as mknod in the container. But opening still fails)
    3) lxc.aa_profile = unconfined (along with steps 1 & 2)
    Please advise what I can do to making /dev/mem accessible in lxc

  7. Michal says:

    Hey,
    Im trying to run lxc with selinux on Centos 7. I installed and created lxc container with simple command lxc-create -n test -t centos. After that i just added to the:
    /var/lib/lxc/test/config
    line like belowe:
    lxc.selinux.context = system_u:system_r:lxc_t:s0:c22
    or your entry:
    lxc.se_context = unconfined_u:unconfined_r:lxc_t:s0-s0:c0.c1023
    Doesn’t matter which, issue is the same.
    When i want to run it, im getting error:

    [root@Centos test]# lxc-start -n test
    lxc-start: confile.c: parse_line: 1750 unknown key lxc.selinux.context
    lxc-start: parse.c: lxc_file_for_each_line: 57 Failed to parse config: lxc.selinux.context = system_u:system_r:lxc_t:s0:c22

    lxc-start: lxc_start.c: main: 268 Failed to create lxc_container

    I have sent mail to the mailing list, written on github, forums, asked on freenode but i haven’t got any answer. There isn’t any good article or manual to help me fix it. Someone can help me resolve this problem ?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.