Notifications
Mark all as read
Q&A

Unable to `mount` overlayfs in Docker container when inside a LXC with a ZFS pool

+4
−0

Summary/Context

I'm currently working to improve performance and throughput of our automation infrastructure, most of which is a combination of Bash/Shell scripts, Python scripts, Docker, Jenkins, etc. We use Yocto to build embedded Linux distributions for specialized hardware and we have a Docker image to define/run our build environment/process.

Because of how our Docker containers work, using the -v option to bind-mount the host file system into itself, there're race conditions whenever you want to run parallel jobs. To help remedy this, I'm using a Bash script to automate the setup of an overlay file system. That allows me to transparently present the environment to the Docker containers in the way they expect it to be, without them "realizing" that there're overlays underneath to prevent actual data races.

This was tested in several Linux systems including my Laptop (Ubuntu 20.04) and several virtual machines (Ubuntu 20.04), without issues. However, I noticed that when the docker containers exist inside a LXC-based system container (using Ubuntu 20.04 and ext4 fs), the mount command executed from inside the docker container fails. (Docker has Ubuntu 14.04.05 LTS inside.)

The question boils down to this: How can I successfully run the mount command from within a Docker container, that is running within a LXC-based container, so that the Docker container itself can set up and use the overlay filesystem?

Details

One of the servers I manage hosts several Jenkins nodes inside LXC-based system containers. All of the LXC-based Jenkins nodes are running Ubuntu 20.04 LTS, exist within the same ZFS Pool, and are kept up-to-date. (For environment details, please see the end of this post.)

The overlay setup step was written to execute as part of the "startup" process when the Docker container is launched. The launch command looks basically as follows (with some actual data being ommitted/<placeholders>):

$ docker run --rm -it --privileged \
    -e USER_MOUNT_CACHE_OVERLAY=1 \
    -v <host work directory path>:/home/workdir \
    -v <host Git repositories path>:/home/localRepos \
    <image>:<tag> \
    bash

The Bash script uses mktemp to create the (work and read-write) directories that will be used for the overlay. A manual example of the mount command being used is:

$ sudo mount -t overlay sstate-cache \
    -o lowerdir=sstate-cache,upperdir=overlayfs/cache-rw,workdir=overlayfs/cache-work \
    sstate-cache

When I do this in my Laptop or any other non-LXC node, everything works fine. However, when the Docker container running the mount command exists inside a LXC node, this error shows up:

mount: wrong fs type, bad option, bad superblock on /home/workdir/sstate-cache,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

The exit code returned is 32, which is simply documented in the man mount pages as "mount failure". The contents in /var/log/syslog don't seem to have anything relevant. The dmesg command, shows these:

overlayfs: filesystem on '/var/lib/docker/check-overlayfs-support309123547/upper' not supported as upperdir
overlayfs: filesystem on '/var/lib/docker/check-overlayfs-support422513534/upper' not supported as upperdir
[...]
overlayfs: filesystem on 'overlayfs/cache-rw' not supported as upperdir

I've been trying to fix this since last week, but I still have no idea why this error would show up nor how to fix it. Many search results have not been relevant to my specific case.

Some Things I've Tried

I found that Docker required the --privileged option in order to allow the mount command to work, so that's the reason it's there. This fixed the original mount issue in my Laptop and other VMs. (For the LXC nodes, it simply prevents a Docker crash; you'd see Go-lang stack traces otherwise.)

But LXD/LXC has its own security options. Its security.nesting had already been set to true by me a few years back to let Docker containers to run; this has not been an issue. I tried making the LXC container itself privileged with:

$ lxc config set <node> security.privileged true

where <node> is the name of the LXC node, but it made no difference. Note that replacing/destroying the ZFS Pool and/or LXC itself are not valid options.

Remarks (Could Be Wrong)

While the file system of the LXC-based node is ext4, as can be confirmed by looking at the filesystem table

$ cat /etc/fstab
LABEL=cloudimg-rootfs   /        ext4   defaults        0 0

the entire LXC-based container is stored in a ZFS Pool. A few years ago, I had enabled ZFS Compression in the physical host, which should've been completely transparent not only to the LXCs, but also the Docker containers. However, I observed issues with the du command, where it would calculate incorrect disk usage results, which then caused other parts of our build process to fail.

While I can't be certain, and however unlikely this may be (I have no way to test/verify this), I have been asking myself if there're maybe some other ZFS options that could be affecting this. To me, it seems more likely that existing LXC options might do the trick, but I'm not sure which ones those could be.

I already took a look at this question elsewhere, but I've not found any similar error messages.

Environment Details (Host, LXCs, Docker)

Operating System (Physical Host)

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.1 LTS
Release:        20.04
Codename:       focal

ZFS Version (Physical Host)

$ zfs -V
zfs-0.8.3-1ubuntu12.4
zfs-kmod-0.8.3-1ubuntu12.4

LXC Version (Physical Host, Snap Package)

$ lxc --version
4.7

Docker Version (Inside LXC, Jenkins Node)

$ docker --version
Docker version 19.03.12, build 48a66213fe

Operating System (Inside Docker Container)

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:        14.04
Codename:       trusty
Why should this post be closed?

3 comments

If I set LXC's security.privileged = true, launching a --privileged docker container results in this error: ghost-in-the-zsh‭ 14 days ago

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:415: setting cgroup config for procHooks process caused \\\"failed to write \\\\\\\"a *:* rwm\\\\\\\" to \\\\\\\"/sys/fs/cgroup/devices/docker/<HASH SNIP>/devices.allow\\\\\\\": write /sys/fs/cgroup/devices/docker/<HASH SNIP>/devices.allow: operation not permitted\\\"\"": unknown. ghost-in-the-zsh‭ 14 days ago

Also, the LXC configs already had security.nested set to true. ghost-in-the-zsh‭ 14 days ago

1 answer

+0
−0

Summary

The TL;DR is that, as long as ZFS is being used as the underlying file system, mount commands on top of that will not work. It's simply not supported. I was also able to confirm this over email with Ubuntu/LXD developer, Stéphane Graber, with him saying, in part, that:

overlay doesn't work on top of ZFS. This isn't a permission or a container issue, it's a filesystem issue.

I've outlined two possible solutions below in more detail. Which one works for you may depend on your setup, as was my case, so read carefully.

Jenkins Workaround (My Case)

This "solution" is really more of a workaround than anything else. TL;DR: Don't let jobs that need overlays get scheduled in nodes that live within a ZFS pool.

My server setup boils down to a RAIDed NVMe OS installation drive and a set of larger data drives. The OS drive lacks the capacity for the work that needs to be done and all of the data drives in this setup are VDEVs in the ZFS Pool. This means that there's no other place where the more ideal case of using LXD to add a new non-ZFS pool (see later) could be implemented. (In fact, LXD itself lives within the ZFS pool I had set up.)

Therefore, the workaround here was to re-arrange the Jenkins labels for both the jobs and the nodes. (If you're not familiar with Jenkins, it relies on these labels to determine which nodes can service which jobs/tasks.) The labels are arranged in such a way that Jenkins will never schedule a job that requires mounting an overlay to a LXC-based node that cannot support it.

As the system administrator, you should know how your nodes are set up. The process of re-labeling jobs and nodes is done manually in the node/job configuration pages from Jenkins itself. In this case, you simply make sure that nodes hosted within a ZFS pool never have labels that match those of jobs that need overlays.

Note that your Docker containers must still be launched with the --privileged option for the mount commands to work. This is independent from the ZFS-specific issue described.

Adding Non-ZFS Pools (Ideal)

Note that this solution assumes you have extra drives and/or locations that are outside of the existing ZFS pool. Also note that, as I had said before, I was not able to confirm this myself due to my particular setup. Make sure you understand these steps before trying to apply them, as you do run the risk of destroying your own data, or at least accidentally "hiding" it, if you fail to understand what you're doing.

The TL;DR: is from Stéphane Graber:

Your only way out of this is to have you /var/lib/docker not be on ZFS. You can do that either by completely changing your storage pool to something else or by creating a second pool, allocate a dedicated volume from that and attach it to /var/lib/docker inside your container.

That's effectively the setup we made for Travis-CI where the instance itself is on a ZFS pool but /var/lib/docker comes from a directory (ext4) pool so overlayfs is happy.

This would've been a more ideal solution, but I don't have the ability to implement this in my setup. I'm including it here for the sake of completeness, but this is untested by me and you may need to do more work to properly adapt this solution to your setup.

Note that, in my case, it is Jenkins that's being used to run the jobs and Docker bind-mounts some directories from the host into itself. Therefore, paths in later steps will focus on /var/lib/jenkins instead of /var/lib/docker.

First, understand that you cannot trust the /etc/fstab info from inside the LXC. For example, my LXCs says:

$ cat /etc/fstab
LABEL=cloudimg-rootfs   /        ext4   defaults        0 0

While it clearly claims to be ext4, this ext4 filesystem may still be (and in my case, actually was) on top of ZFS. Therefore, fstab data should be ignored. Stéphane mentions you should rely on /proc/mounts:

Your container is on ZFS, not on ext4, ignore /etc/fstab and look at /proc/mounts instead

In this case, you need to use LXD/LXC to set up a pool that is completely independent/separate from the pre-existing ZFS pool. Then create volumes there and attach them to your LXCs. (This likely means you'll need extra drives, b/c ZFS likes to consume whole drives when adding them to a pool. As an example, if you're using ZFS on Linux, and your installation drive is using an ext4 file system, then you will not be able to include this drive as part of the zfs pool.)

The steps below are what I used prior to remembering that the extra non-ZFS pool was still on top of the existing ZFS pool, so, had it not been for that detail, this should've worked. Note that these steps assume a pre-existing Jenkins installation that you want to preserve. Otherwise, you can remove steps as needed.

After marking your Jenkins node offline and SSH'ing into it, move the Jenkins home directory to a backup location and create a new empty directory for it:

mv /var/lib/jenkins /var/lib/jenkins.old
mkdir /var/lib/jenkins

From a separate shell, SSH into the system hosting the LXCs. I chose to stop my LXC, but this may not be required:

$ lxc stop jenkins-node-01

Then create a non-ZFS storage pool and a storage volume inside of it. Here, the pool's name is jenkins and its driver is dir:

$ lxc storage create jenkins dir
$ lxc storage list              
+----------+-------------+--------+------------------------------------------------+---------+
|   NAME   | DESCRIPTION | DRIVER |                     SOURCE                     | USED BY |
+----------+-------------+--------+------------------------------------------------+---------+
| jenkins  |             | dir    | /var/snap/lxd/common/lxd/storage-pools/jenkins | 0       |
+----------+-------------+--------+------------------------------------------------+---------+
| lxd-pool |             | zfs    | tank/lxc                                       | 7       |
+----------+-------------+--------+------------------------------------------------+---------+

It's up to you to make sure that your LXD installation is not hosted within the ZFS pool in question, like in my setup. Otherwise, this is where you'd be going back to square one.

Then create the node-specific volumes inside of it. This is what I ran for my #1 node:

$ lxc storage volume create jenkins jenkins-node-01
$ lxc storage list                                 
+----------+-------------+--------+------------------------------------------------+---------+
|   NAME   | DESCRIPTION | DRIVER |                     SOURCE                     | USED BY |
+----------+-------------+--------+------------------------------------------------+---------+
| jenkins  |             | dir    | /var/snap/lxd/common/lxd/storage-pools/jenkins | 1       |
+----------+-------------+--------+------------------------------------------------+---------+
| lxd-pool |             | zfs    | tank/lxc                                       | 7       |
+----------+-------------+--------+------------------------------------------------+---------+

Note that, after creating the new jenkins-node-01 volume inside the jenkins pool, it now shows the pool is hosting 1 volume. Attach the volume to your node in the correct path:

$ lxc storage volume attach jenkins jenkins-node-01 jenkins-node-01 jenkins /var/lib/jenkins

Note that both the volume and the node are named jenkins-node-01. This is not an error. To confirm the volume is attached, you can use the lxc config show <node-name> and the volume should show up under the devices section. (If you decided to skip the "backup" step, you will be mounting the volume on top of the pre-existing directory and when you go back to your LXC, the directory will be empty because the volume on top of it is empty. Your data has not been destroyed; only hidden "under" the volume you just mounted. Just detach the volume and don't skip the "backup" step.)

If you had stopped your LXC, you may now run lxc start <your-node> and SSH back into it. From within the LXC node, change the file ownership back to the jenkins account (or whatever account you used) and then copy over the data from the prior backup location into it:

$ chown -R jenkins:jenkins /var/lib/jenkins
$ cp -r /var/lib/jenkins.old/ /var/lib/jenkins/

Your node should now be ready for use and can be brought back online from the Jenkins admin GUI. After you've verified that everything is working as expected, you should be able to remove the /var/lib/jenkins.old/ backup directory.

Be aware that, from now on, if this volume gets destroyed, your data goes with it. If you have backup processes, such as those that use lxc export ..., you may need to modify your process as the export command only includes the containers, not their volumes as you might assume.

0 comments

Sign up to answer this question »