Projects like this and Docker make me seriously wonder where software engineering is going. Don't get me wrong, I don't mean to criticize Docker or Toro in parcicular. It's the increasing dependency on such approaches that bothers me.
Docker was conceived to solve the problem of things "working on my machine", and not anywhere else. This was generally caused by the differences in the configuration and versions of dependencies. Its approach was simple: bundle both of these together with the application in unified images, and deploy these images as atomic units.
Somewhere along the lines however, the problem has mutated into "works on my container host". How is that possible? Turns out that with larger modular applications, the configuration and dependencies naturally demand separation. This results in them moving up a layer, in this case creating a network of inter-dependent containers that you now have to put together for the whole thing to start... and we're back to square one, with way more bloat in between.
Now hardware virtualization. I like how AArch64 generalizes this: there are 4 levels of privilege baked into the architecture. Each has control over the lower and can call up the one immediately above to request a service. Simple. Let's narrow our focus to the lowest three: EL0 (classically the user space), EL1 (the kernel), EL2 (the hypervisor). EL0, in most operating systems, isn't capable of doing much on its own; its sole purpose is to do raw computation and request I/O from EL1. EL1, on the other hand, has the powers to directly talk to the hardware.
Everyone is happy, until the complexity of EL1 grows out of control and becomes a huge attack surface, difficult to secure and easy to exploit from EL0. Not good. The naive solution? Go a level above, and create a layer that will constrain EL1, or actually, run multiple, per-application EL1s, and punch some holes through for them to still be able to do the job—create a hypervisor. But then, as those vaguely defined "holes", also called system calls and hyper calls, grow, won't so the attack surface?
Or in other words, with the user space shifting to EL1, will our hypervisor become the operating system, just like docker-compose became a dynamic linker?
Containers got popular at at time when there were an increasingly number of people that were finding it hard to install software on their system locally - especially if you were, for instance, having to juggle multiple versions of ruby or multiple versions of python and those linked to various major versions of c libraries.
Unfortunately containers have always had an absolutely horrendous security story and they degrade performance by quite a lot.
The hypervisor is not going away anytime soon - it is what the entire public cloud is built on.
While you are correct that containers do add more layers - unikernels go the opposite direction and actively remove those layers. Also, imo the "attack surface" is by far the smallest security benefit - other architectural concepts such as the complete lack of an interactive userland is far more beneficial when you consider what an attacker actually wants to do after landing on your box. (eg: run their software)
When you deploy to AWS you have two layers of linux - one that AWS runs and one that you run - but you don't really need that second layer and you can have much faster/safer software without it.
I can understand the public cloud argument; if the cloud provider insists on you delivering an entire operating system to run your workloads, a unikernel indeed slashes the amount of layers you have to care about.
Suppose you control the entire stack though, from the bare metal up. (Correct me if I'm wrong, but) Toro doesn't seem to run on real hardware, you have to run it atop QEMU or Firecracker. In that case, what difference does it make if your application makes I/O requests through paravirtualized interfaces of the hypervisor or talks directly to the host via system calls? Both ultimately lead to the host OS servicing the request. There isn't any notable difference between the kernel/hypervisor and the user/kernel boundary in modern processors either; most of the time, privilege escalations come from errors in the software running in the privileged modes of the processor.
Technically, in the former case, besides exploiting the application, a hypothetical attacker will also have to exploit a flaw in QEMU to start processes or gain further privileges on the host, but that's just due to a layer of indirection. You can accomplish this without resorting to hardware virtualization. Once in QEMU, the entire assortment of your host's system calls and services is exposed, just as if you ran your code as a regular user space process.
This is the level you want to block exec() and other functionality your application doesn't need at, so that neither QEMU nor your code ran directly can perform anything out of their scope. Adding a layer of indirection while still leaving user/kernel, or unikernel/hypervisor junction points unsupervised will only stop unmotivated attackers looking for low-hanging fruit.
I can't speak for all the various projects but imo these aren't made for bare metal - if you want true bare metal (metal you can physically touch) use linux.
One of the things that might not be so apparent is that when you deploy these to something like AWS all the users/process mgmt/etc. gets shifted up and out of the instance you control and put into the cloud layer - I feel that would be hard to do with physical boxen cause it becomes a slippery slope of having certain operations (such as updates) needing auth for instance.
I always thought of Docker as a "fuck it" solution. It's the epitomy of giving up. Instead of some department at a company releasing a libinference.so.3 and a libinference-3.0.0.x86_64.deb they ship some docker image that does inference and call it a microservice. They write that they launched, get a positive performance review, get promoted, and the Docker containers continue to multiply.
>what an attacker actually wants to do after landing on your box.
Aren't there ways of overwriting the existing kernel memory/extending it to contain an a new application if an attacker is able to attack the running unikernel?
What protections are provided by the unikernel to prevent this?
To be clear there are still numerous attacks one might lob at you. For instance you if you are running a node app and the attacker uploads a new js file that they can have the interpreter execute that's still an issue. However, you won't be able to start running random programs like curling down some cryptominer or something - it'd all need to be contained within that code.
What becomes harder is if you have a binary that forces you to rewrite the program in memory as you suggest. That's where classic page protections come into play such as not exec'ing rodata, not writing to txt, not exec'ing heap/stack, etc. Just to note that not all unikernel projects have this and even if they do it might be trivial to turn them off. The kernel I'm involved with (Nanos) has other features such as 'exec protection' which prevents that app from exec-mapping anything not already explicitly mapped exec.
Running arbitrary programs, which is what a lot of exploit payloads try to achieve, is pretty different than having to stuff whatever they want to run inside the payload itself. For example if you look at most malware it's not just one program that gets ran - it's like 30. Droppers exist solely to load third party programs on compromised systems.
If the stack and heap are non-executable and page tables can't be modified then it's hard to inject code. Whether unikernels actually apply this hardening is another matter.
This results in them moving up a layer, in this case creating a network of inter-dependent containers that you now have to put together for the whole thing to start... and we're back to square one, with way more bloat in between.
I think you're over-egging the pudding. In reality, you're unlikely to use more than 2 types of container host (local dev and normal deployment maybe), so I think we've moved way beyond square 1. Config is normally very similar, just expressed differently, and being able to encapsulate dependencies removes a ton of headaches.
damn… i am a big fan of bryan and i thought i was a big fan of unikernels… well, i still am, but all the points he makes are absolutely well-founded. i will say, in contraposition to the esteemed (and hilarious) mr. cantrill, that it is quite incredible to get to the end of an article about unikernels without seeing any mention of how the “warmup” time for a unikernel is subsecond whereas the warmup time for, say, containers is… let’s just call it longer than the warmup time for the water i am heating to make some pourover coffee after i finish my silly post. to dismiss this as a profound advantage is to definitely sell the idea more than a little short.
but at the same time i do think it is fair at this juncture to compare the tech to things like wasm, with which unikernels are much more of a direct competitor than containers. it is ironic because i can already hear in my head the hilarious tirade he would unleash about how horrific docker is in every way, debugging especially, but yet somehow this is worse for containers than for unikernels. my view at the present is that unikernels are fantastic for software you trust well enough to compile down to the studs, and the jury is still out on their efficacy beyond that. but holy fuck i seear to god i have spent more time fucking with docker than many professional programmers have spent learning their craft en toto, and i have nothing to show for it. it sucks every time, no matter what. my only gratitude for that experience revolves around (1) saving other peoples’ time on my team (same goes for git, but git is, indisputably, a “good” technology, all things considered, which redeems it entirely), and (2) it motivated me to learn about all the features that linux, systemd, et al. have (chroot jails, namespaces, etc.) in a way that surely exceeds my natural interest level.
Just to save people from wasting their time reading this drivel:
`
If this approach seems fringe, things get much further afield with language-specific unikernels like MirageOS that deeply embed a particular language runtime. On the one hand, allowing implementation only in a type-safe language allows for some of the acute reliability problems of unikernels to be circumvented. On the other hand, hope everything you need is in OCaml!
`
ToroKernel is written in freepascal.
All of the text before and after is completely irrelevant
Cantrill is far smarter and accomplished than me, but this article feels a bit strawman and hit and run?
I think unikernels potentially have their place, but as he points, they remain mostly untried, so that's fair. We should analyze why that is.
On performance: I think the kernel makes many general assumptions that some specialized domains may want to short circuit entirely. In particular I am thinking how there's a whole body of research of database buffer pool management basically having elaborate work arounds for kernel virtual memory subsystme page management techniques, and I suspect there's wins there in unikernel world. Same likely goes for inference engines for LLMs.
The Linux kernel is a general purpose utility optimizing for the entire range of "normal things" people do with their Linux machines. It naturally has to make compromises that might impact individual domains.
That and startup times, big world of difference.
Is it going to help people better sling silly old web pages and whatever it is people do with computers conventionally? Yeah, I'd expect not.
On security, I don't think it's unreasonable or pure "security theatre" to go removing an attack surface entirely by simply not having it if you don't need it (no users, no passwords, no filesystem, whatever). I feel like he was a bit dismissive here? That is also the principle behind capability-passing security to some degree.
I would hate to see people close the door on a whole world of potentials based on this kind of summary dismissal. I think people should be encouraged to explore this domain, at least in terms of research.
> On performance: ... In particular I am thinking how there's a whole body of research of database buffer pool management
Why? The solution has always been turn off what the kernel does, and, do these things in userspace, not move everything in the kernel? Where are these performance gains to be had?
> The Linux kernel is a general purpose utility optimizing for the entire range of "normal things" people do with their Linux machines.
Yeah, like logging. Perhaps you say: "Oh we just add that logging to the blob we run". Well isn't that now another thing that can take down the system, when before it was a separate process?
> That and startup times, big world of difference.
Perhaps in this very narrow instance, this is useful, but what is it useful for? Can't Linux or another OS be optimized for this use case without having to throw the baby out with the bathwater? Can't one snapshot a Firecracker VM and reach even faster startup times?
> On security, I don't think it's unreasonable or pure "security theatre" to go removing an attack surface entirely
Isn't perhaps the most serious problem removing any and all protection domains? Like between apps and the kernel and between the apps themselves?
I mean -- sure remove the filesystem, but not the memory protection? But even then, a filesystem is a useful abstraction. I've never found myself wanting less filesystem. Usually, I want more, like ZFS.
I use LXD + LXC, wondering if this is worth trying or if the overhead of accessing (network, etc) would be too much to deal with/care about.
Also always a little wary of projects that have bad typos or grammar problems in a README - in particular on one of the headings (thought it's possible these are on purpose?). But that's just me :\
I've been using unikraft (https://unikraft.org/) unikernels for a while and the startup times are quite impressive (easily sub-second for our Rust application).
I don't want the observability of my applications to be bound by themselves, it's kind of a real pain. I'm all for microvm images without excess dependencies, but coupling the kernel and diagnostic tools to rapidly developing application code can be a real nightmare as soon as the sun stops shining.
My last name is finally on the front page of HN as a project name, look mah!
I was not expecting Pascal, thats an interesting choice. One thing I do like is that Freepascal has one of the better ways of making GUIs meanwhile every other language had decided that just letting Javascript build UIs is the way.
Smaller than containers seems unlikely since a container doesn't have any kernel at all, while these microvms have to reproduce at least the amount of kernel they would otherwise need (e.g., a networking stack). I'm sure some will be inclined to compare an optimized microvm to an application binary slapped into an Ubuntu container image, but that's obviously apples/oranges.
Faster might be possible without the context switching between kernel and app? And maybe additional opportunities for the compiler to optimize the entire thing (e.g., LTO)?
Presumably to avoid the cost of context switches or copying between kernel/user address spaces? Looks to be the opposite of userspace networking like DPDK: kernel space application programming.
I think one reason UniKernels can be different are perhaps that they can allow more isolation or run user generated code perhaps inside the Unikernel with proper isolation whereas I don't think actors can do that
Docker was conceived to solve the problem of things "working on my machine", and not anywhere else. This was generally caused by the differences in the configuration and versions of dependencies. Its approach was simple: bundle both of these together with the application in unified images, and deploy these images as atomic units.
Somewhere along the lines however, the problem has mutated into "works on my container host". How is that possible? Turns out that with larger modular applications, the configuration and dependencies naturally demand separation. This results in them moving up a layer, in this case creating a network of inter-dependent containers that you now have to put together for the whole thing to start... and we're back to square one, with way more bloat in between.
Now hardware virtualization. I like how AArch64 generalizes this: there are 4 levels of privilege baked into the architecture. Each has control over the lower and can call up the one immediately above to request a service. Simple. Let's narrow our focus to the lowest three: EL0 (classically the user space), EL1 (the kernel), EL2 (the hypervisor). EL0, in most operating systems, isn't capable of doing much on its own; its sole purpose is to do raw computation and request I/O from EL1. EL1, on the other hand, has the powers to directly talk to the hardware.
Everyone is happy, until the complexity of EL1 grows out of control and becomes a huge attack surface, difficult to secure and easy to exploit from EL0. Not good. The naive solution? Go a level above, and create a layer that will constrain EL1, or actually, run multiple, per-application EL1s, and punch some holes through for them to still be able to do the job—create a hypervisor. But then, as those vaguely defined "holes", also called system calls and hyper calls, grow, won't so the attack surface?
Or in other words, with the user space shifting to EL1, will our hypervisor become the operating system, just like docker-compose became a dynamic linker?
Unfortunately containers have always had an absolutely horrendous security story and they degrade performance by quite a lot.
The hypervisor is not going away anytime soon - it is what the entire public cloud is built on.
While you are correct that containers do add more layers - unikernels go the opposite direction and actively remove those layers. Also, imo the "attack surface" is by far the smallest security benefit - other architectural concepts such as the complete lack of an interactive userland is far more beneficial when you consider what an attacker actually wants to do after landing on your box. (eg: run their software)
When you deploy to AWS you have two layers of linux - one that AWS runs and one that you run - but you don't really need that second layer and you can have much faster/safer software without it.
Suppose you control the entire stack though, from the bare metal up. (Correct me if I'm wrong, but) Toro doesn't seem to run on real hardware, you have to run it atop QEMU or Firecracker. In that case, what difference does it make if your application makes I/O requests through paravirtualized interfaces of the hypervisor or talks directly to the host via system calls? Both ultimately lead to the host OS servicing the request. There isn't any notable difference between the kernel/hypervisor and the user/kernel boundary in modern processors either; most of the time, privilege escalations come from errors in the software running in the privileged modes of the processor.
Technically, in the former case, besides exploiting the application, a hypothetical attacker will also have to exploit a flaw in QEMU to start processes or gain further privileges on the host, but that's just due to a layer of indirection. You can accomplish this without resorting to hardware virtualization. Once in QEMU, the entire assortment of your host's system calls and services is exposed, just as if you ran your code as a regular user space process.
This is the level you want to block exec() and other functionality your application doesn't need at, so that neither QEMU nor your code ran directly can perform anything out of their scope. Adding a layer of indirection while still leaving user/kernel, or unikernel/hypervisor junction points unsupervised will only stop unmotivated attackers looking for low-hanging fruit.
One of the things that might not be so apparent is that when you deploy these to something like AWS all the users/process mgmt/etc. gets shifted up and out of the instance you control and put into the cloud layer - I feel that would be hard to do with physical boxen cause it becomes a slippery slope of having certain operations (such as updates) needing auth for instance.
The story is quite different in HP-UX, Aix, Solaris, BSD, Windows, IBM i, z/OS,...
There are AppContainers. Those have existed for a while and are mostly targeted at developers intending to secure their legacy applications.
https://learn.microsoft.com/en-us/windows/win32/secauthz/app...
There's also Docker for Windows, with native Windows container support. This one is new-ish:
https://learn.microsoft.com/en-us/virtualization/windowscont...
Aren't there ways of overwriting the existing kernel memory/extending it to contain an a new application if an attacker is able to attack the running unikernel?
What protections are provided by the unikernel to prevent this?
What becomes harder is if you have a binary that forces you to rewrite the program in memory as you suggest. That's where classic page protections come into play such as not exec'ing rodata, not writing to txt, not exec'ing heap/stack, etc. Just to note that not all unikernel projects have this and even if they do it might be trivial to turn them off. The kernel I'm involved with (Nanos) has other features such as 'exec protection' which prevents that app from exec-mapping anything not already explicitly mapped exec.
Running arbitrary programs, which is what a lot of exploit payloads try to achieve, is pretty different than having to stuff whatever they want to run inside the payload itself. For example if you look at most malware it's not just one program that gets ran - it's like 30. Droppers exist solely to load third party programs on compromised systems.
I think you're over-egging the pudding. In reality, you're unlikely to use more than 2 types of container host (local dev and normal deployment maybe), so I think we've moved way beyond square 1. Config is normally very similar, just expressed differently, and being able to encapsulate dependencies removes a ton of headaches.
[0]: https://www.tritondatacenter.com/blog/unikernels-are-unfit-f...
but at the same time i do think it is fair at this juncture to compare the tech to things like wasm, with which unikernels are much more of a direct competitor than containers. it is ironic because i can already hear in my head the hilarious tirade he would unleash about how horrific docker is in every way, debugging especially, but yet somehow this is worse for containers than for unikernels. my view at the present is that unikernels are fantastic for software you trust well enough to compile down to the studs, and the jury is still out on their efficacy beyond that. but holy fuck i seear to god i have spent more time fucking with docker than many professional programmers have spent learning their craft en toto, and i have nothing to show for it. it sucks every time, no matter what. my only gratitude for that experience revolves around (1) saving other peoples’ time on my team (same goes for git, but git is, indisputably, a “good” technology, all things considered, which redeems it entirely), and (2) it motivated me to learn about all the features that linux, systemd, et al. have (chroot jails, namespaces, etc.) in a way that surely exceeds my natural interest level.
` If this approach seems fringe, things get much further afield with language-specific unikernels like MirageOS that deeply embed a particular language runtime. On the one hand, allowing implementation only in a type-safe language allows for some of the acute reliability problems of unikernels to be circumvented. On the other hand, hope everything you need is in OCaml! `
ToroKernel is written in freepascal.
All of the text before and after is completely irrelevant
Unikernels don't work for him; there are many of us who are very thankful for them.
Why? Can you explain, in light of the article, and for those of us who may not be familiar with qubes-mirage-firewall, why?
I think unikernels potentially have their place, but as he points, they remain mostly untried, so that's fair. We should analyze why that is.
On performance: I think the kernel makes many general assumptions that some specialized domains may want to short circuit entirely. In particular I am thinking how there's a whole body of research of database buffer pool management basically having elaborate work arounds for kernel virtual memory subsystme page management techniques, and I suspect there's wins there in unikernel world. Same likely goes for inference engines for LLMs.
The Linux kernel is a general purpose utility optimizing for the entire range of "normal things" people do with their Linux machines. It naturally has to make compromises that might impact individual domains.
That and startup times, big world of difference.
Is it going to help people better sling silly old web pages and whatever it is people do with computers conventionally? Yeah, I'd expect not.
On security, I don't think it's unreasonable or pure "security theatre" to go removing an attack surface entirely by simply not having it if you don't need it (no users, no passwords, no filesystem, whatever). I feel like he was a bit dismissive here? That is also the principle behind capability-passing security to some degree.
I would hate to see people close the door on a whole world of potentials based on this kind of summary dismissal. I think people should be encouraged to explore this domain, at least in terms of research.
Why? The solution has always been turn off what the kernel does, and, do these things in userspace, not move everything in the kernel? Where are these performance gains to be had?
> The Linux kernel is a general purpose utility optimizing for the entire range of "normal things" people do with their Linux machines.
Yeah, like logging. Perhaps you say: "Oh we just add that logging to the blob we run". Well isn't that now another thing that can take down the system, when before it was a separate process?
> That and startup times, big world of difference.
Perhaps in this very narrow instance, this is useful, but what is it useful for? Can't Linux or another OS be optimized for this use case without having to throw the baby out with the bathwater? Can't one snapshot a Firecracker VM and reach even faster startup times?
> On security, I don't think it's unreasonable or pure "security theatre" to go removing an attack surface entirely
Isn't perhaps the most serious problem removing any and all protection domains? Like between apps and the kernel and between the apps themselves?
I mean -- sure remove the filesystem, but not the memory protection? But even then, a filesystem is a useful abstraction. I've never found myself wanting less filesystem. Usually, I want more, like ZFS.
Also always a little wary of projects that have bad typos or grammar problems in a README - in particular on one of the headings (thought it's possible these are on purpose?). But that's just me :\
Neat.
I was not expecting Pascal, thats an interesting choice. One thing I do like is that Freepascal has one of the better ways of making GUIs meanwhile every other language had decided that just letting Javascript build UIs is the way.
Now I write Javascript and SQL.
:)
Plus these might be smaller and might run faster than containers too.
Faster might be possible without the context switching between kernel and app? And maybe additional opportunities for the compiler to optimize the entire thing (e.g., LTO)?
file sharing is complex too it seems
would be good to see a benchmark or something showing where it shines