The Shape of Code

Home > Uncategorized > A safety-critical certification of the Linux kernel

A safety-critical certification of the Linux kernel

May 25, 2025 (3 weeks ago) Derek Jones Leave a comment Go to comments

This week there was an announcement on the system-safety mailing list that the Red Hat In-Vehicle Operating System (a version of the Linux kernel, plus a few subsystems) had been certified as being “… capable for use in ASIL B applications, …”. The Automotive Safety Integrity Levels (ASIL A is the lowest level, with D the highest; an example of ASIL B is controlling brake lights) are defined by ISO 26262, an international standard for functional safety of electrical and/or electronic systems installed in production road vehicles.

Given all I had heard about the problems that needed to be solved to get a safety certification for something as large and complicated as the Linux kernel, I wanted to know more about how Red Hat had achieved this certification.

The traditional, idealised, approach to certifying software is to check that all requirements are documented and traceable to the design, source code, tests, and test results. This information can be used to ensure that every requirement is implemented and produces the intended behavior, and that no undocumented functionality has been implemented.

When this approach is not practical (because of onerous time/cost), a potential get out of jail card is to use a Rigorous Development Process. Certification friendly development processes appear to revolve around lots of bureaucracy and following established buzzword techniques. The only development processes that have sometimes produced very reliable software all involve spending lots of time/money.

The major problems with certifying Linux are the apparent lack of specification/requirements documents, and a development process that does not claim to be rigorous.

The Red Hat approach is to treat the Linux man pages as the specification, extract the requirements from these pages, and then write the appropriate tests. Traceability looks like it is currently on the to-do list.

I have spent a lot of time working to understand specifications and their requirements; first with the C Standard and then with the Microsoft Server Protocols. This is the first time I have encountered man pages being used in a formal setting (sometimes they are used as one of the inputs to a reimplementation of a library).

Very little Open source software has a written specification in the traditional sense of a document cited in a contract that the vendor agrees to implement. Manuals, READMEs and help pages are not written in the formal style of a specification. A common refrain is that the source code is the specification. However, source is a specification of what the program does, it is not a specification of what the program is supposed to do.

The very nature of the Agile development process demands that there not be a complete specification. It’s possible that user stories could be treated as requirements.

There is an ISO Standard with Linux in its title: the Linux Base Standard. The goal of this Standard “… is to develop and promote a set of open standards that will increase compatibility among Linux distributions and enable software applications to run on any compliant system …”, i.e., it is not a specification of an OS kernel.

POSIX is a specification of the behavior of an OS (kernel functionality is specified by POSIX.1, .2 is shell and utilities, plus other .x documents). It’s many years since I tracked POSIX/Linux compliance, which was best described as “highly compatible”. Both Grok3 and ChatGPT o4 agree that “highly compatible” is still true, and list some known incompatibilities.

While they are not written in the form of a specification, the Linux man pages do have a consistent structure and are intended to be up-to-date. A person with a background of working with Linux kernels could probably extract meaningful requirements.

How many requirements are needed to cover the behavior of the Linux kernel?

On my computer running a 6.8.0-51 kernel, the /usr/include/linux directory contains 587 header files. Based on an analysis from 20 years ago (table 1897.1), most of these headers only declare macros and types (e.g., memory layout), not function declarations. The total number of function declarations in these headers is probably in the low thousands. POSIX (2008 version) defines 1,177 functions, but the number of system calls is probably around 300-400. Android implements 821 of these functions, of which 343 are system call related.

Let’s assume 2,000 functions. Some of these functions have an argument that specifies one or more optional values, each specifying a different sub-behavior. How many different sub-behaviors are there? If we assume that each kind of behavior is specified using a C macro, then Table 1897.1 suggests there might be around 10k C macros defined in these headers.

With positive/negative tests for each case, in round numbers we get (ignoring explicit testing of the values of struct members): (2,000+10,000)*2 = 24,000 test files.

This calculation does not take into account combinations of options. I’m assuming that each test file will loop through various combinations of its kind of sub-behavior.

The 1990 C compiler validation suite contained around 1k tests. Thirty-five years later, 24k test files for a large OS feels low, but then combination testing should multiply the number of actual tests by at least an order of magnitude.

What is this hand-wavy analysis missing?

I suspect that the kernel is built with most of the optional functionality conditionally compiled out. This could significantly reduce the number of api functions and the supported options.

I have not taken into account any testing of the user-visible kernel data structures (because I don’t have any occurrence data).

Comments from readers with experience in testing OSes most welcome.

Another source of Linux specific information is the Linux Kernel documentation project. I don’t have any experience using this documentation, but the API documentation is very minimalist (automatically extracted from the source; the Assessment report lists this document as [D124], but never references it in the text).

Readers familiar with safety standards will be asking about the context in which this certification applies. Safety functions are not generic; they are specific to a safety-related system, i.e., software+hardware. This particular certification is for a Safety Element out of Context (SEooC), where “out of context” here means without the context of a system or knowledge of the safety goals. SEooC supports a bottom up approach to safety development, i.e., Safety Elements can be combined, along with the appropriate analysis and testing, to create a safety-related system.

This certification is the first of what I think will be many certifications of Linux, some at more rigorous safety levels.

Categories: Uncategorized Tags: API, certify, documentation, Linux, requirements, safety critical, specification, testing

Comments (4) Trackbacks (0) Leave a comment Trackback

Luca

May 26, 2025 (3 weeks ago) 14:58 | #1

Reply | Quote

Since I’m developing some parts of an IEC61508-certified product, and have followed some of the ISO26262 requirements for a similar integrity level, I have a bit of a personal take on this. However, let’s add that I’m no expert, even though I read all the standards, I keep being a newbie. Nonetheless, when I think about what I’ve been through for a humble 10 kLOCs application, my first thought about ASIL-B Linux is “it’s a joke”. The second thought is “They are heavy-weight and paid what’s needed for cheating”. Again, sure it is that I’ve unfounded bad thoughts.
I’d just like to know, with such a large project (that wasn’t born with safety as a primary objective), how they adhered to a strict MISRA (or alike) safety standard, and how they employed 100% unit testing, both with an already certified tool. How they drop any pointer arithmetics and value casting. How they documented each safety-related function in their interaction with the whole system, and for countermeasure actions. How they set up periodic diagnostics on the whole system with a safety state and actions to be performed on failure. And what T3-level certified software tools they employed to generate code and configure the system to IEC61784, as for all the communication channels.
As a novice developer in the safety systems world, with a limited knowledge, I can only have a few hundred pages of questions popping to my mind.
It’s a joke.
Derek Jones

May 26, 2025 (3 weeks ago) 15:43 | #2

Reply | Quote

@Luca
I look at the various ASIL levels as steps towards more reliable software. Companies start at the lower bottom and get something certified to control headlights. Then, hopefully, competition kicks in, another manufacturer does a better job of getting the same certification, and so on. Safety level and third-party opinion of the work undertaken becomes a sales tool.

In the case of Open source, there is a possibility of many people pitching in to help. It depends on what the various vendors publish, after all these certifications do offer a competitive edge.

MISRA started out a long way from where it is today. In the early days, many tools were generating a warning that had nothing to do with the source being flagged. Over time, customer demand significantly improved the quality of tools, the poorer quality ones didn’t sell very well.
Luca

May 27, 2025 (3 weeks ago) 09:44 | #3

Reply | Quote

Hundreds of experts (and many more monkeys typing on keyboards) seem to have taken a few years to make it (with the help of a mentioned certificator) https://elisa.tech/blog/

Still, three years ago, their presentation was wondering “What if we can’t reach 100% Code Coverage?”.
I feel like it’s just one of the many (hundreds) mandatory requirements for any ASIL/SIL level.
I can write code testable for 100% coverage, compliant with any ASIL level, and intended to do nothing useful. Even better if it’s buried in a very large project, a Rube Goldberg safe machine.

I don’t know, I’m doubtful about the concept that just throwing in more programmers of good will can lead to any achievement – as a new era “mythical man-month”.
Derek Jones

May 27, 2025 (3 weeks ago) 10:10 | #4

Reply | Quote

@Luca
Typing monkeys could probably get to 95% coverage, then life becomes very difficult. There is probably dead code that will take some thinking about for people to realise is dead, and there will be code that requires really hard to replicate conditions. Then there will be the code that looks like it could be executed, but nobody knows how.

Any safety-critical Linux kernel is very likely to be a small subset of the distributed kernel.

My experience with creating a highly conformant C compiler discussed here.

No trackbacks yet.

The inconvenient history of Liberal Fascism Software_Engineering_Practices = Morals+Theology

A safety-critical certification of the Linux kernel

Recent Posts

Recent Comments

Archives

Meta