If anything, it's the problem with the design of the UNIX's process management, inherited thoughtlessly, which Docker decided to not deal with on its own. Why does there have to be a whole special, unkillable process whose only job is to call wait(2) in an infinite loop? Because in the original UNIX design, Ken Thompson apparently did not want to do too much work in the kernel during exit(2): if process A calls exit(2) while having 20 already exited children it didn't wait for, you either have to reap those 20 processes (which involves reading their PCBs from the swap on disk), and then potentially reap their already exited children, and their grandchildren... or you can just iterate over the process table and set the ppid of A's children to 1 and schedule PID 1 to run and let it deal with reaping one process at a time in wait(2). Essentially, the work is pushed to the scheduler, but the logic itself lives in the user space at the cost of PID space pollution.
cyphar 21 hours ago [-]
On the other hand, process managers care about the exit signal of child processes and the most straightforward way is to keep around a zombie that just contains that information and ensures the only identifier available to userspace at the time continues to reference the same pid (of course, zombies on Linux contain some more information but some of that is sometimes necessary and pidfds remove some of the need for this).
The funny thing is that there is a way to opt out of zombie reaping as pid1 or a subreaper -- set sigaction of SIGCHLD to SIG_IGN (and so it really isn't that hard on the kernel side). Unfortunately this opts you out of all child death events, which means process managers can't use it.
If you want to argue that interaction of zombies with re-parenting is borked, that is a very different discussion (though re-parenting itself is necessary for daemonisation). If there was a way to only opt-out of zombie reaping for reparented processes things would be much nicer.
IMHO the bigger issue with Docker and pid1 is that pid1 signal semantics (for instance, most signals are effectively SIG_IGN by default) are different than other processes and lots of programs didn't deal with that properly back then. Nowadays it might be a bit better, it Docker has also had a built-in minimal init for many years (just use --init) so the problem is basically solved these days.
Joker_vD 19 hours ago [-]
> the most straightforward way is to keep around a zombie that just contains that information and ensures the only identifier available to userspace at the time continues to reference the same pid.
Again, that's only really needed if the zombie process's parent has not yet died itself; in fact, this is also how Windows API operates: unless you call CloseHandle on the process descriptor obtained from the CreateProcess, the child process will not go away and will hang around. However, if there are no open descriptors to a process, then it will disappear entirely on its own the moment it calls ExitProcess; and if that process had last descriptors open to some other zombie processes, those too will go away. All of that, and without no need for a dedicated PID 1 which shall not be killed.
Reparenting, of course, is not needed for long-running services/daemons, as demonstrated by daemontools, runit, s6, Upstart, systemd, etc.
cyphar 5 hours ago [-]
The handle solution is definitely better (which is why I mentioned pidfds -- I actually think it might be possible to do this today with SIG_IGN and PIDFD_GET_INFO but it's a little hacky) but Unix only had pids and most descendants only have pids too. In that paradigm the zombie solution is kind of inevitable (as with most other Unix hacks). My point was that it wasn't as simple as "just doing some more work in exit(2)" -- you would need to redesign the process API.
There are loads of other related issues to this of course -- libraries cannot really spawn subprocesses as part of their implementation because programs using the library could see the SIGCHLD by accident. With the right design, the handle approach is better here.
> Reparenting, of course, is not needed for long-running services/daemons, as demonstrated by daemontools, runit, s6, Upstart, systemd, etc.
Because they are spawned as children of the daemon. Reparenting is needed for standard Unix utilities that want to run in the background (especially in an interactive shell). Of course, you can redesign that too if you have a time machine, but it would require more work than you originally intimated.
stevefan1999 1 days ago [-]
They tried to kill Kubernetes, now they try to kill Docker. Fascinating
trilogic 1 days ago [-]
Docker is like infecting your pc on purpose, running whatever in slowmotion. Do the math!
akerouanton 1 days ago [-]
> Every time the Docker daemon starts, it changes iptables’s FORWARD chain’s policy to DROP for no reason.
Prior to v28, iptables rules were written in such a way that they depended upon the default `FORWARD` policy. To get proper container isolation, that default policy had to be set to `DROP`.
That's not the case anymore. Iptables rules have been rewritten to not depend on that default policy, but we're still setting it as users might (un)knowingly depend on that to secure their system. We thought it wasn't worth the trouble to change that after so many years. However, we added an escape hatch in the form of a new daemon parameter (named `ip-forward-no-drop`) to not force users to disable iptables integration altogether when they don't want that default policy.
v29.0 will have support for nftables. It'll be marked as experimental in the first few releases to allow us to change anything without worrying about backward compatibility. However, it already provides the same feature coverage as iptables. Things will be a bit different with this firewall backend though - the Engine will refuse to start if sysctl `net.ipv4.ip_forward` is not set to 1. Users will have to set it on their own, consider the security implications, and take the necessary measures to block forwarding between non-Docker interfaces. Our rules will be isolated in their own nft table, so hopefully it'll feel less like "Docker owns the system".
> Docker makes it quite difficult to deploy IPv6 properly in containers, [...] since Docker relies on NAT [...] The only way around this is to… write your own firewall rules
This is not true anymore. We added a network-level parameter to use IPv6 without NAT, and keep the semantic of `-p` (the port-publishing flag).
For instance, you can create a non-NAT / "routed" network with: `docker network create -o com.docker.network.bridge.gateway_mode_ipv6=routed --ipv6 testnet`. That network will get a ULA subnet assigned if no IPv6 `--subnet` was provided.
If you run a container with a published port, e.g. `docker run --network testnet -p 80/tcp …`, you container's port 80 will be accessible but not other ports.
The downside of that approach is that some / all of your routers in your local network need to learn about this subnet to correctly route it to the Docker host.
cyphar 21 hours ago [-]
> This is not my area of expertise but this is omitting that user namespaces tend to drastically increase the attack surface (despite what some vendors say).
Configuring user namespaces for the container to improve containment = very good idea. Enabling CLONE_NEWUSER inside a container = (usually) a very bad idea.
You can do one without the other, and the built-in user namespaces support in Docker (and Podman) does exactly that.
As one of the runc maintainers, I can say without reservation that user namespaces would have blocked the vast majority of container breakout attacks in the past decade and you absolutely should use them. The only technology with a similar track record for improving container security is seccomp. (SELinux folks will argue that SELinux deserves mention or maybe even top billing, but I have somewhat mixed opinions on that.)
This is not even an unusual opinion. LXC doesn't even consider containers with user namespaces disabled part of their threat model, precisely because it's so insecure to not use them[1]. Also, in my experience, most kernel developers generally assume (incorrectly) that most users use user namespaces when isolating containers and so make some security design decisions around that assumption. In every talk I've given on container security in the past few years I have urged people to use user namespaces.
It is even better for each container to have its own uid/gid block. Podman, LXC and runc all support this but Docker doesn't really (though I think there was some work on this recently?). The main impediment to proper user namespaces support for most users was the lack of support for transparent uid/gid remapping of mount points but that is a solved problem now and has been for a few years (MOUNT_ATTR_IDMAP).
> v29.0 will have support for nftables. It'll be marked as experimental in the first few releases to allow us to change anything without worrying about backward compatibility.
It would've been nice to at least link to the EPIC[1] to provide some kind of evidence that it is actually been worked on. Sorry to be a bit snarky, but this has been a known issue for 9 years now[2,3,4] with no development despite claims to the contrary by Docker Inc (Mirantis now, I guess?).
Out of interest, is this being merged with libnetwork or is being implemented separately (a quick look gave me the impression it was being implemented separately but libnetwork was re-merged to Docker a while ago). Also I guess your comment in [5] is outdated and it will actually be in v29?
If anything, it's the problem with the design of the UNIX's process management, inherited thoughtlessly, which Docker decided to not deal with on its own. Why does there have to be a whole special, unkillable process whose only job is to call wait(2) in an infinite loop? Because in the original UNIX design, Ken Thompson apparently did not want to do too much work in the kernel during exit(2): if process A calls exit(2) while having 20 already exited children it didn't wait for, you either have to reap those 20 processes (which involves reading their PCBs from the swap on disk), and then potentially reap their already exited children, and their grandchildren... or you can just iterate over the process table and set the ppid of A's children to 1 and schedule PID 1 to run and let it deal with reaping one process at a time in wait(2). Essentially, the work is pushed to the scheduler, but the logic itself lives in the user space at the cost of PID space pollution.
The funny thing is that there is a way to opt out of zombie reaping as pid1 or a subreaper -- set sigaction of SIGCHLD to SIG_IGN (and so it really isn't that hard on the kernel side). Unfortunately this opts you out of all child death events, which means process managers can't use it.
If you want to argue that interaction of zombies with re-parenting is borked, that is a very different discussion (though re-parenting itself is necessary for daemonisation). If there was a way to only opt-out of zombie reaping for reparented processes things would be much nicer.
IMHO the bigger issue with Docker and pid1 is that pid1 signal semantics (for instance, most signals are effectively SIG_IGN by default) are different than other processes and lots of programs didn't deal with that properly back then. Nowadays it might be a bit better, it Docker has also had a built-in minimal init for many years (just use --init) so the problem is basically solved these days.
Again, that's only really needed if the zombie process's parent has not yet died itself; in fact, this is also how Windows API operates: unless you call CloseHandle on the process descriptor obtained from the CreateProcess, the child process will not go away and will hang around. However, if there are no open descriptors to a process, then it will disappear entirely on its own the moment it calls ExitProcess; and if that process had last descriptors open to some other zombie processes, those too will go away. All of that, and without no need for a dedicated PID 1 which shall not be killed.
Reparenting, of course, is not needed for long-running services/daemons, as demonstrated by daemontools, runit, s6, Upstart, systemd, etc.
There are loads of other related issues to this of course -- libraries cannot really spawn subprocesses as part of their implementation because programs using the library could see the SIGCHLD by accident. With the right design, the handle approach is better here.
> Reparenting, of course, is not needed for long-running services/daemons, as demonstrated by daemontools, runit, s6, Upstart, systemd, etc.
Because they are spawned as children of the daemon. Reparenting is needed for standard Unix utilities that want to run in the background (especially in an interactive shell). Of course, you can redesign that too if you have a time machine, but it would require more work than you originally intimated.
Prior to v28, iptables rules were written in such a way that they depended upon the default `FORWARD` policy. To get proper container isolation, that default policy had to be set to `DROP`.
That's not the case anymore. Iptables rules have been rewritten to not depend on that default policy, but we're still setting it as users might (un)knowingly depend on that to secure their system. We thought it wasn't worth the trouble to change that after so many years. However, we added an escape hatch in the form of a new daemon parameter (named `ip-forward-no-drop`) to not force users to disable iptables integration altogether when they don't want that default policy.
We published a blog post about that, and other security hardening measures we took in v28: https://www.docker.com/blog/docker-engine-28-hardening-conta...
v29.0 will have support for nftables. It'll be marked as experimental in the first few releases to allow us to change anything without worrying about backward compatibility. However, it already provides the same feature coverage as iptables. Things will be a bit different with this firewall backend though - the Engine will refuse to start if sysctl `net.ipv4.ip_forward` is not set to 1. Users will have to set it on their own, consider the security implications, and take the necessary measures to block forwarding between non-Docker interfaces. Our rules will be isolated in their own nft table, so hopefully it'll feel less like "Docker owns the system".
> Docker’s lack of UID isolation by default
This is not my area of expertise but this is omitting that user namespaces tend to drastically increase the attack surface (despite what some vendors say). For instance: https://blog.qualys.com/vulnerabilities-threat-research/2025....
> Docker makes it quite difficult to deploy IPv6 properly in containers, [...] since Docker relies on NAT [...] The only way around this is to… write your own firewall rules
This is not true anymore. We added a network-level parameter to use IPv6 without NAT, and keep the semantic of `-p` (the port-publishing flag).
For instance, you can create a non-NAT / "routed" network with: `docker network create -o com.docker.network.bridge.gateway_mode_ipv6=routed --ipv6 testnet`. That network will get a ULA subnet assigned if no IPv6 `--subnet` was provided.
If you run a container with a published port, e.g. `docker run --network testnet -p 80/tcp …`, you container's port 80 will be accessible but not other ports.
The downside of that approach is that some / all of your routers in your local network need to learn about this subnet to correctly route it to the Docker host.
Configuring user namespaces for the container to improve containment = very good idea. Enabling CLONE_NEWUSER inside a container = (usually) a very bad idea.
You can do one without the other, and the built-in user namespaces support in Docker (and Podman) does exactly that.
As one of the runc maintainers, I can say without reservation that user namespaces would have blocked the vast majority of container breakout attacks in the past decade and you absolutely should use them. The only technology with a similar track record for improving container security is seccomp. (SELinux folks will argue that SELinux deserves mention or maybe even top billing, but I have somewhat mixed opinions on that.)
This is not even an unusual opinion. LXC doesn't even consider containers with user namespaces disabled part of their threat model, precisely because it's so insecure to not use them[1]. Also, in my experience, most kernel developers generally assume (incorrectly) that most users use user namespaces when isolating containers and so make some security design decisions around that assumption. In every talk I've given on container security in the past few years I have urged people to use user namespaces.
It is even better for each container to have its own uid/gid block. Podman, LXC and runc all support this but Docker doesn't really (though I think there was some work on this recently?). The main impediment to proper user namespaces support for most users was the lack of support for transparent uid/gid remapping of mount points but that is a solved problem now and has been for a few years (MOUNT_ATTR_IDMAP).
[1]: https://linuxcontainers.org/lxc/security/
It would've been nice to at least link to the EPIC[1] to provide some kind of evidence that it is actually been worked on. Sorry to be a bit snarky, but this has been a known issue for 9 years now[2,3,4] with no development despite claims to the contrary by Docker Inc (Mirantis now, I guess?).
Out of interest, is this being merged with libnetwork or is being implemented separately (a quick look gave me the impression it was being implemented separately but libnetwork was re-merged to Docker a while ago). Also I guess your comment in [5] is outdated and it will actually be in v29?
[1]: https://github.com/moby/moby/issues/49634 [2]: https://github.com/moby/libnetwork/issues/1998 [3]: https://github.com/moby/moby/issues/26824 [4]: https://github.com/docker/for-linux/issues/1472 [5]: https://github.com/docker/for-linux/issues/1472#issuecomment...