Hacker Remix

Build a Container Image from Scratch

196 points by prakashdanish 2 months ago | 56 comments

godelski 1 month ago

I often wonder, why isn't systemd-nspawn[0] used more often? It's self-described as "chroot on steroids". IME it pretty much lives up to that name. Makes it really easy to containerize things and since it integrates well with systemd you basically don't have to learn new things.

I totally get these are different tools and I don't think nspawn makes docker or podman useless, but I do find it interesting that it isn't more used, especially in things you're using completely locally. Say, your random self-hosted server thing that isn't escaping your LAN (e.g. Jellyfin or anything like this)

[0] https://wiki.archlinux.org/title/Systemd-nspawn

cmeacham98 1 month ago

Because Docker/OCI/etc got the most important part right (or at least much better than the alternatives): distribution.

All you need to start running a Docker container is a location and tag (or hash). To update, all you do is bump the tag (or hash). If a little more complicated setup is necessary (environment variables, volumes, ports, etc) - this can all be easily represented in common formats like Docker compose or Kubernetes manifests.

How do you start running a system-nspawn container? Well first, you bootstrap an entire OS, then deal with that OS's package manager to install the application. You have to manage updates with the package manager yourself (which likely aren't immutable). There's no easy declarative config - you'll probably end up writing a shell script or using a third party tool like Ansible.

There have been many container/chroot concepts in the past. Docker's idea was not novel, but they did building and distribution far better than any alternative when it first released, and it still holds up well today.

ranger207 1 month ago

Yeah, this. Docker/container's greatest feature is less the sandboxing than the distribution. The sandboxing is essential to making the distribution work well, but it's a side feature most of the time

placardloop 1 month ago

It’s kind of funny that people think of “sandboxing” as the main feature of containers, or even as a feature at all. The distribution benefits have always been the entire point of Docker.

The logo of Docker is a ship with a bunch of shipping containers on it (the original logo was clearer, but the current logo still shows this). “Containers” has never been about “containment”, but about modularity and portability.

shykes 1 month ago

Docker introduced an ambiguity in the meaning of the word "container". The word existed before Docker, and it was about sandboxing. Docker introduced the analogy of the shipping container, which as ranger207 says, is about sandboxing at the service of distribution.

The two meanings - sandboxing and distribution - have coexisted ever since, sometimes causing misunderstandings and frustration.

globular-toast 1 month ago

It's not about sandboxing or distribution, it's about having a regular interface. This is why the container analogy works. In the analogy the ship is a computer and the containers are programs. Containers provide a regular interface such that a computer can run any program that is packaged up into a container. That's how things like Kubernetes work. They don't care what's in the container, just give them a container and they can run it.

This is as opposed to the "old world" where computers needed to be specifically provisioned for running said program (like having interpreters and libraries available etc.), which is like shipping prior to containers: ships were more specialised to carrying particular loads.

The analogy should not be extended to the ship moving and transporting stuff. That has nothing to do with it. The internet, URLs and tarballs have existed for decades.

guappa 1 month ago

Docker containers ran as root by default for a great number of years. I'm not even sure if it has now finally been changed.

They provided no sandboxing whatsoever.

ninkendo 1 month ago

That’s a horrendously bad take, running as uid0 in the container doesn’t mean “no sandboxing whatsoever”. You’re still namespaced with respect to pids/network interfaces/filesystem/etc, and it’s not supposed to be possible to escape it, even when running as root in the container.

Is it possible to do container escapes on occasion? Yes, but each of those is a bug in the Linux kernel that is assigned a CVE and fixed.

Running as non-root in the container is an additional layer of security but it’s not all-or-nothing: doing so doesn’t make you perfectly secure (privilege escalation bugs will continue to exist) and not doing so doesn’t constitute “nothing whatsoever”.

guappa 1 month ago

I see you're not aware of `mknod`?

> Is it possible to do container escapes on occasion? Yes, but each of those is a bug in the Linux kernel that is assigned a CVE and fixed.

No bug, if you have permissions to run mknod it's an entirely by design escape that docker lets you do :)

I wasn't talking about kernel bugs, of course there have been a lot of those causing escapes. I am talking about the default configuration that does absolutely 0 sandboxing. And it's not a bug, it's as intended.

If you want to run as root and don't even touch capabilities… yeah it's root. 0 protection, the stuff in the container is running as root and can easily escape namespaces.

maple3142 1 month ago

I really wonder how can use escape a container given a root shell created by `docker run --rm -it alpine:3 sh` without using a 0day? Using latest Docker and a reasonably up-to-date Linux kernel of course.

With the command above it is still possible to attack network targets, but let's just ignore it here. I just wonder how is it possible to obtain code execution outside the namespace without using kernel bugs.

edoceo 1 month ago

Can you show me how? Like, if I'm in a stock debian-slim container, and have mknod, and I've started as root, how can I get from inside the container to the host? Could I create files/run program on the host? Portscan localhost? Do something crazy with the docker socket?

mdaniel 1 month ago

> I see you're not aware of `mknod`?

Try harder, friend, those require granted capabilities

  $ PAGER=cat man 7 capabilities | grep -C1 MKNOD

       CAP_MKNOD (since Linux 2.4)
              Create special files using mknod(2).

  $ docker run --rm -it public.ecr.aws/docker/library/ubuntu:24.04 /usr/bin/mknod fred b 252 4
  /usr/bin/mknod: fred: Operation not permitted

guappa 1 month ago

It'd be interesting to know what's in your /etc docker configuration :)

ninkendo 1 month ago

Yeah you’re going to need to elaborate and post your sources here. If there’s zero protection at all, show how I can run `docker run -it alpine sh` and break out of the container. Without exploiting any 0days.

No, --privileged doesn’t count. No, --cap-add=<anything> doesn’t count. The claim here is that docker has “zero sandboxing” by default, so you’re going to need to show that you don’t need either of those. Not just moving the goalposts and saying you can break out if you use the command line flag that literally says “privileged”.

godelski 1 month ago

Sorry. I agree, but that's a different question. I'll circle back to that then. Why don't technical people make these interfaces, giving the same love to user experience that something like Docker gets. As you said, it is scriptable, and I think -- us all being programmers here -- we all know that means you can just make the interface easier.

prmoustache 1 month ago

Are you implying that docker or podman hasn't been made by _technical people_?

godelski 1 month ago

No? I'm not sure I follow. I wouldn't say Apple wasn't made by technical people either. Saying technical people frequently ignore the importance of design does not mean that anyone who recognizes the importance of design is non technical

vaylian 1 month ago

> I often wonder, why isn't systemd-nspawn[0] used more often?

I think most people simply don't know about it. A lot of people also don't know that there are alternatives to Docker.

I use both, systemd-nspawn and podman containers. They serve different purposes:

systemd-nspawn: Run a complete operation system in a container. Updates are applied in-place. The whole system is writeable. I manage this system myself. I also use the -M switch for the systemctl and journalctl commands on the host to peek into my nspawn-containers. I create the system with debootstrap.

podman: Run a stripped down operating system or just a contained executable with some supporting files. Most of the system is read-only with some writeable volumes mounted at well-defined locations in the file system tree. I don't manage the container image myself and I have activated auto-updates via the quadlet definition file. I create the container based on an image from a public container registry.

Both solutions have their place. systemd-nspawn is a good choice if you want to create a long-lived linux system with lots of components. podman/docker containers are a good choice if you want to containerize an application with standard requirements.

systemd-nspawn is good for pet containers. podman is good for cattle containers.

fuhsnn 1 month ago

I just started learning to setup containers and found nspawn a total convenience, just create ./usr, throw some static-linked binaries to ./bin and systemd-nspawn -D would handle the rest including network pass-through.

wvh 1 month ago

I used this extensively at the time Docker was up and coming. It worked well, much faster than Docker volumes, but required a lot of scripting and clean-up. What Docker got right, apart from distribution, is better separation of host system and whatever mess you are creating. You do not want to make a mistake bootstrapping an OS or forgetting to `chroot` to the right volume.

magicalhippo 1 month ago

> Say, your random self-hosted server thing that isn't escaping your LAN (e.g. Jellyfin or anything like this)

I tried reading your link but I'm none the wiser, so perhaps you could provide the docker-equivalent one-liner to start a Jellyfin instance using systemd-nspawn?

godelski 1 month ago

There isn't a one liner because no one has built it. Which you be clear, this also had to be done for docker.

I'll admit, the documentation to really anything systemd kinda sucks but awareness can help change that

magicalhippo 1 month ago

Ok, so I misread your question.

You're asking why hasn't anyone made something like Docker but with systemd-nspawn as the runtime or "engine".

edit: Found this article[1], which tries to do just that. Still not as convenient as Docker, but doesn't look terrible either.

[1]: https://benjamintoll.com/2022/02/04/on-running-systemd-nspaw...

godelski 1 month ago

Yeah, definitely there is a big difference between something being technically better (or worse) and the actual usability of a thing. We have a long history of products that are not technically better winning out (for many reasons). I'm confident nspawn doesn't have nearly the attention and even few people know about it. Docs definitely suck. But we're also on a very technical forum, not a general audience one, so I kinda assume a context that people here are not as concerned about the user interface.

magicalhippo 1 month ago

> But we're also on a very technical forum, not a general audience one, so I kinda assume a context that people here are not as concerned about the user interface.

I think that's a common mistake. I'm fairly highly technical compared to your average user, but I don't have that much higher tolerance to friction for stuff that's not my core concern.

Poor UX is definitely friction, and system administration is seldom my core concern. I'm fairly certain I'm not unique.

godelski 1 month ago

I definitely explained that poorly. I was in the middle of some other work and typed too quickly. I'm sorry.

I more mean that technical people tend to be more willing to slug through a poor UX if the tool is technically better. I mean we are all programmers here, right? Programming is a terrible UX, but it is the best thing we got to accomplish the things we want. I'm saying that these people are often the first adopters, more willing to try new things. Of course, this doesn't describe every technical person, but the people willing to do these things are a subset of the technical group.

I definitely see UX as a point of friction and I do advocate for building good interfaces. I actually think it is integral to building things that are also performant and better from a purely technical perspective. I feel that as engineers/developers/researchers we are required to be a bit grumpy. Our goal is to improve things, to make new things, right? One of the greatest means of providing direction to that is being frustrated by existing things lol. Or as Linus recently said: "I'm just fixing potholes." If everything is alright then there's nothing to improve, so you gotta be a little grumpy. It's just about being the right kind of grumpy lol

jmholla 1 month ago

If the author is here, I think there's a typo in this. In section 1.4, you start working from the scratch layer, but the content continues to refer to alpine as the base layer.

    FROM scratch
    
    COPY ./hello /root/
    
    ENTRYPOINT ["./hello"]

> Here, our image contains 2 layers. The first layer comes from the base image, the alpine official docker image i.e. the root filesystem with all the standard shell tools that come along with an alpine distribution. Almost every instruction inside a Containerfile generates another layer. So in the Containerfile above, the COPY instruction creates the second layer which includes filesystem changes to the layer before it. The change here is “adding” a new file—the hello binary—to the existing filesystem i.e. the alpine root filesystem.

prakashdanish 1 month ago

Thanks for pointing that out, I'm curating a PR with all the suggestion from here, should be fixed soon!

psnehanshu 1 month ago

That, and they also added the "time" command in the config with the scratch base image.

mortar 1 month ago

Just learnt about whiteout files from this, thanks! Trying to understand if you purposely included a filename into a layer with the same whiteout prefix “.wh.”, if it would mess with the process that is meant to obfuscate that prefix from subsequent layers.

m463 1 month ago

I learned about $_

  echo abc && echo $_
  abc
  abc

except it's used with wget...

  wget URL && tar -xvf $_

does this work? Shouldn't tar take a filename?

hmm... also, it says there is an alpine layer with "FROM scratch"??

godelski 1 month ago

$_ is the last argument. Here's a better example to illustrate

  > echo 'Hello' 'world' 'my' 'name' 'is' 'godelski'
  Hello world my name is godelski
  > echo $_
  godelski
  > !:0 !:1 !:2 "I'm" "$_"
  Hello world I'm godelski

The reference manual is here[0] and here's a more helpful list[1]

One of my favorites is

  > git diff some/file/ugh/hierarchy.cpp
  > git add $_
  ## Alternatively, but this is more cumbersome (but more flexible)
  !!:s^diff^add

So what is happening with wget is

  > wget https://dl-cdn.alpinelinux.org/alpine/v3.18/releases/x86_64/alpine-minirootfs-3.18.4-x86_64.tar.gz && tar -xvf $_
  ## Becomes
  > wget https://dl-cdn.alpinelinux.org/alpine/v3.18/releases/x86_64/alpine-minirootfs-3.18.4-x86_64.tar.gz
  > tar -xvf https://dl-cdn.alpinelinux.org/alpine/v3.18/releases/x86_64/alpine-minirootfs-3.18.4-x86_64.tar.gz

Which you are correct, doesn't work.

It should actually be something like this

  > wget https://dl-cdn.alpinelinux.org/alpine/v3.18/releases/x86_64/alpine-minirootfs-3.18.4-x86_64.tar.gz -O alpine.tar.gz && tar xzf $_

This would work as the last parameter is correct. I also added `z` to the tar and removed `-` because it isn't needed. Note that `v` often makes untaring files MUCH slower

[0] https://www.gnu.org/software/bash/manual/html_node/Bash-Vari...

[1] https://www.gnu.org/software/bash/manual/html_node/Variable-...

ryencoke 1 month ago

If you want to add in another bash trick called Parameter Expansion[0] you can parse out the filename automatically with the special variable $_. Something like:

  > wget https://dl-cdn.alpinelinux.org/alpine/v3.18/releases/x86_64/alpine-minirootfs-3.18.4-x86_64.tar.gz && tar xzf ${_##*/}

[0] https://www.gnu.org/software/bash/manual/html_node/Shell-Par...

MyOutfitIsVague 1 month ago

I want to make a small note, in that `$_` is a special Bashism (though it is supported widely), but Parameter Expansion is POSIX-standard and will work on all POSIX-compliant shells, not just Bash.

https://pubs.opengroup.org/onlinepubs/009604499/utilities/xc...

vendiddy 1 month ago

You know I just realized after all these years I still don't quite know what a shell is.

You have iTerm, Terminal, etc. But what do those do? Those are not the shells themselves right?

mdaniel 1 month ago

I wanted to offer that (if you are "C handy") writing your own shell is a super informative exercise. We had to write our own shell in my operating system class at GT and I actually got it working well enough that I could "exec ./myshell" and use it for some day to day stuff. I felt empowered

I tried to dig up the course but naturally things are wwaaaaaaay different now than back in my day. But OCW has something similar https://ocw.mit.edu/courses/6-828-operating-system-engineeri... and does ship the source files https://ocw.mit.edu/courses/6-828-operating-system-engineeri... although I have no idea why that's only present in a graduate level class

kritr 1 month ago

iTerm and Terminal are pieces of software emulate a physical terminal environment. They take the output of programs/shells output characters and control codes to render text, clear the screen, etc.

The terminal emulator receives keyboard input via your operating system, and passes it to the shell program via stdin.

The shell is responsible for prompting you and handling whatever you type. For example the “$ “ waits for next character from the terminal emulator until you hit newline.

The shell is responsible for parsing your input, executing any child programs “ls” for example, outputting their content to stdout, and prompting you again.

godelski 1 month ago

##*/ is probably one of my most used parameter expansions. This was definitely a better solution than the one I proposed lol

alkh 1 month ago

I am surprised that this is working, as I always thought that variables get initialized after the full command is parsed. So, I would assume that $_ would be related to the previous command (defined by a new line) and not this one, because there's no newline character here, but only an ampersand.

godelski 1 month ago

&& means there's a sequence. So the second statement will only execute conditioned on the first sequence. So...

  > thisFunctionFails && echo "Hello world" && echo "I SAID $_"
  
  > thisFunctionSucceeds && echo "Hello world" && echo "I SAID $_"
  Hello World
  I SAID Hello World

The left function has to get evaluated before the next function. So it is still related to the previous command.

fragmede 1 month ago

TIL! I use alt-. for that when running interactively, good to know there's a way to do that in a script

mdaniel 1 month ago

It's not an alpine layer, it's a Dockerfile construct representing basically an empty tar file layer: <https://docs.docker.com/build/building/base-images/#create-a...> and <https://github.com/moby/moby/pull/8827>

m463 1 month ago

He says:

  FROM scratch

  COPY ./hello /root/

  ENTRYPOINT ["./hello"]

But I thought "FROM scratch" was an empty container, while "FROM alpine" is a container with alpine libs/executables.

otherwise using "FROM scratch" to populate for example an ubuntu image would pollute the container.

prakashdanish 1 month ago

You're right, that doesn't work the way it is shown. Thankfully, a reader(I'm not sure if it's you) pointed this out with a solution[1] that I plan to add to the post shortly. Again, thanks for pointing this out!

[1] - https://github.com/danishprakash/danishpraka.sh/issues/30

DeathArrow 1 month ago

By containers here the author seems to understand Docker containers. But there are other types of containers like Linux/OpenVZ containers, Windows containers etc.

adminm 1 month ago

Yep. Also containers used in the shipping industry. You might have yet another ype in your fridge.

The thing is that because Docker started the craze, the word "container" without further context in the IT world has become to mean docker container.

prakashdanish 1 month ago

Yes that's what I meant, but while not specifically Docker containers, I did mean Linux containers that are most commonly managed by container engines such as Podman or Docker.