TL;DR: In Linux kernel and as part of the Kernel Self Protection Project we are pushing for new lightweight security mechanisms. On top of that, in systemd we are implementing new lightweight container mechanisms that target Embedded Linux and IoT. Our goal is to make it easy to deploy Secure Embedded Linux and IoT systems.
Working with Embedded Linux systems, device drivers, system and kernel security, allows me to inspect what will be deployed in production. I noticed a common pattern for some devices, either they contain simple kernel vulnerabilities within third party drivers, or deployed applications run with higher privileges, they lack common sandbox and security features. In this blog post, I will present a brief introduction on some mechanisms that allow you to improve your Linux-based IoT security, before discussing new plans that are in development from systemd to Linux kernel hardening.
Linux Kernel Hardening and security
Linux is one of the most popular operating systems for Embedded and IoT systems. Linux is a beast, it solves most embedded world use cases, however, supporting lot of technologies has its own code complexity cost. Security hardening measures allow to hide this complexity, to reduce the attack surface, to contain apps, and to defeat some kernel or user space exploitation techniques. The Linux kernel Self Protection project  tries to take this further by offering Kernel Self Protection features, the aim to protect against known and unknown Linux kernel bugs and vulnerabilities.
Some Embedded Kernel Hardening features:
CONFIG_DEFAULT_MMAP_MIN_ADDR=32768 Disallow allocating the first 32k of memory to protect against kernel null pointer dereference and related exploits.
CONFIG_STRICT_KERNEL_RWX=y Make kernel text and rodata read-only. Kernel version of W^X.
CONFIG_STRICT_DEVMEM=y and CONFIG_IO_STRICT_DEVMEM=y restrict physical memory access.
CONFIG_SECCOMP=y and CONFIG_SECCOMP_FILTER=y allows userspace to reduce the attack surface.
STATIC_USERMODEHELPER=y Force all usermode helper calls through a single binary
CONFIG_HARDENED_USERCOPY=y performs extra usercopy checks.
For more security features, please check the following Kernel Self Protection Project guide . There are lot of features that needs to be disabled on production systems. You should also consider trusted boot and TPM support from the beginning, they are not perfect but it is an extra layer to measure the boot process and allow TPM to act on that.
Modernization of proc file system: I have been working on modernizing Linux proc API . The procfs file system is an old virtual file system that exposes lot of kernel information, some of these files can be used as a source for information leaks to get kernel addresses and construct complex attacks. We have came to the conclusion that procfs needs to be updated and it is blocking lot of other Linux features. Together with Alexey Gladkov and based on Andy Lutomirski feedback, we have a branch here: https://github.com/legionus/linux/commits/pidfs-v4  that allows to improve Linux procfs and protect exposed kernel files, only /proc/<pids>/ files will be available.
It also allows to have separate procfs instances per App, this means you are not forced anymore to use PID namespaces to hide some processes, simply mounting a new separate procfs instance with “hidepid=” will allow to hide at some degree other processes that belong to other users, and without using PID namespaces. This is useful for Linux-based IoT, as we may not need to have multiple PID namespace managers. Another introduction RFC is planed soon to try to make the transition as smooth as possible.
Automatic module loading protection: historically, Linux was always able to transparently load kernel modules to satisfy user functionality when needed, this is a crucial point for a better user experience. However, this feature can be abused to load vulnerable modules and exploit some kernel bugs. Several examples come to mind:
kernel: Local privilege escalation in XFRM framework CVE-2017–7184  it was advertised that maybe this vulnerability was used to break Ubuntu systems on security contests.
Most Embedded Linux systems should not allow module loading at all, or only a subset of signed modules, however there are products that need unrestricted module support. Inspired by Grsecurity MODHARDEN  feature, we have tried to implement a new generic module autoloading restriction that can be used by sandbox tools. The result is here: https://lkml.org/lkml/2017/5/22/312 . Thanks to Kees Cook, Solar Designer and Andy Lutomirski for the feedback, I am planning v5.
Generalizing YAMA LSM: Yama LSM  is a Linux security module with a clear purpose of protecting processes from being controlled by other processes using ptrace functionality. Recently, I have been experimenting on how to generalize this behaviour to other kernel interfaces and system calls. The aim is to have a simple interface to restrict some operations from operating on arbitrary processes, or processes that run under a different User ID. This should be a default kernel hardening measure on Embedded Linux.
systemd Sandbox or systemd Lightweight Containers
Sandboxing IoT Apps is another important step, it allows to reduce the exposure from mis-configuration, bugs, or vulnerability exploitation. As a simple example the BrickerBot and similar worms did not use complex 0day exploits. They used simple attack vectors like unprotected remote shells accounts and according to internet resources, lot of IoT devices were affected. The straightforward solution in this case should be a firewall solution plus a powerful sandbox mechanism for apps.
While with every new systemd release we continue to introduce new sandbox mechanisms, by default all these mechanisms are an opt-in operation, in the future we are planning to maybe add another run-time mode to make the sandbox an opt-out operation. Meanwhile, systemd manager now allows you to run your apps from an image like most other container runtimes. However, systemd does not use any standard format, since most of the container run-times that are using this schema are usually over-engineered, and some of them are abusing some Linux kernel features to hide some other misbehavior, etc. In systemd, right now we support Lightweight Containers, by using only file system Mount Namespaces to isolate and ship Apps with their dependencies, we avoid for now the container managers complexity. Network namespaces are used rarely to only block internet access for Apps by disconnecting network interfaces. We may improve network namespace usage, but only to make it easy to integrate within the “ip” tool, that should handle all possible network cases for Embedded Linux setups.
The following lists some new systemd sandbox options:
- New File system sandbox option:
RootImage= Takes a path to a block device, loopback file, etc that can be mounted as the new root filesystem for your App.
2. Some User privileges sandbox options:
DynamicUser= If set to yes, allows to run your App under different User (Unix UID/GID). The UID is allocated dynamically and released when the App stops, allowing IoT devices to follow Android model where each App is executed under a different user, separating Apps and their file access permissions.
NoNewPrivileges= If set, ensures that the App and all its children can never gain new privileges through execve().
3. Some Network sandbox options:
PrivateNetwork= If set to yes, will set up a new private network namespace with only loopback interface inside, disconnecting internet access.
IPAddressDeny= Takes an IP address prefix, all traffic from and to this address will be blocked for the App.
IPAddressAllow= The whitelist or permitted IP address/network mask list.
To block raw packets AF_PACKET you should also use:
RestrictAddressFamilies=~AF_PACKET (blacklisting mode).
We are working to make this more user friendly, maybe in the near feature we will add: “ACCESS_INTERNET=yes|no” alias for those options to effectively block all inet or internet operations, including constructing raw packets and binding privileged ports.
4. Kernel attack surface reduction:
RestrictNamespaces= Restrict Access to Linux namespaces. Most IoT devices should reduce access to Linux User Namespaces since some vulnerabilities and exploits are still targeting this feature. “RestrictNamespaces=yes” or “RestrictNamespaces=~user”
ProtectKernelTunables= Blocks tuning Kernel parameters by making /proc and related /sys files read-only.
ProtectKernelModules= If set, removes the CAP_SYS_MODULE capability and blocks your App from explicitly loading or unloading modules.
SystemCallFilter= Seccomp system call filtering feature. In systemd we have organized Linux system calls in groups inspired from Google Chromium browser. You can restrict your App by functionality by blacklisting the system calls using the “~” before each group. As an example:
“@reboot” will block all related reboot system calls.
“@module” will block all kernel module system calls.
“@mount” will block all file system mount and umount system calls.
For more system call filtering please refer to official systemd documentation systemd.exec . We have a pretty usable system call filtering feature, and we are actively working on improving it. In the near feature we are planning to add more usable groups and improve our sandbox model: https://github.com/systemd/systemd/pull/6963 .
Future plans for systemd: as the systemd project continue to evolve to handle new use cases, we have to face reality: we need 1) reduce our functionality to better handle some IoT requirements, 2) integrate with software update mechanisms. On a more generic approach we have to support more user friendly features. In the past, systemd was intended to be used by experienced service developers and SysVinit experts, today the user base is more of Container and Android Apps model users. This does not mean that we have to copy those models, but we should start with a new smooth run-time model.
Other Important Security Mechanisms
Now days, robust IoT products need to support a software update mechanism, bugs or vulnerabilities may have a long lifetime. A recent research by Kees Cook suggests that between the introduction of a vulnerability and its final fix it is “roughly 5 years”, if you are interested you can read about at his blog Security bug lifetime .
IoT system updates are a hot topic and the sessions at next week “All systems go 2017”  conference confirm this, make sure to check:
“Software updates for connected Linux devices: key requirements” by Drew Moseley from Mender.io  
“The IoT botnet wars, Linux devices, and the absence of basic security hardening” by Drew Moseley from Mender.io  
“Updating Embedded Systems — Putting it all Together” by Michael Olbrich from pengutronix.de 
Updating full system images, operating on the block layer and chunk based updates are all neat features, several new technologies are emerging, casync “Content Addressable Data Synchronizer”  is one example. Others like Mender.io are already offering an Over-the-air software update solution for Embedded Linux.
There are also other specialized solutions using the container based approach, the just released balena , a Moby-based container engine for IoT [ref] that should replace default docker in Resin OS  seems promising. There are use cases where full containers are the solution, with the new balena, resinOS can be adapted to easily integrate with more resource constrained embedded systems.
Finally and to conclude the next All systems go 2017 conf has some other interesting tracks related to Embedded Linux and Security:
“Building a secure boot chain to userland” by Matthew Garrett 
“Securing Home Automation with Tor” by Kalyan Dikshit