How indeterminate is an indeterminate value?
One of the unwritten design aims of the C Standard is that it should be possible to fully implement the C Standard library in conforming C. It turned out that this was not possible in C90; the problem was implementing the memcpy
function when the object being copied was an object having a struct
type containing one or more padding bytes. The memcpy
library function copies the bytes in one object to another object. The padding bytes might be uninitialized (they have an indeterminate value), which means accessing them is undefined behavior (in C90), i.e., use of memcpy
for copying structs containing padding results in a non-conforming program.
struct { char c; // Occupies 1 byte // Possible padding bytes here int i; // A 2/4-byte int sometimes has to be aligned on a 2/4-byte storage boundary }; |
Padding bytes could be set to a known value by, for instance, using memcpy
to zero the storage; requiring this usage was thought to be excessive, and a hefty chunk of new words was added in C99 (some of the issues raised by this problem also cropped up elsewhere, which contributed to the will to do this).
One consequence of the new wording is that objects having type unsigned char
are special in that while their uninitialized value is still indeterminate, the possible set of values excludes a trap representation, they have an unspecified value making accesses unspecified behavior (which conforming programs can contain). The uninitialized value of objects having other types can be a trap representation; it’s the possibility of a value being a trap representation that makes accessing such uninitialized objects undefined behavior.
All well and good, memcpy
can now be implemented in conforming C(99) by copying unsigned chars.
Having made it possible for a conforming program to access an uninitialized object (having type unsigned char
), questions about it actual value can be asked. Its value is indeterminate you say, the clue is in the term indeterminate value. Ok, what does the following value function return?
unsigned char random(void) { unsigned char x; return x ^ x; } |
Exclusiving-oring a value with itself always produces zero. An unsigned char
taking, say, values 0 to 255, pick one and you always get zero; case closed. But where does it say that an indeterminate value is always the same value? There is no wording preventing an indeterminate value being different every time it is accessed. The sound of people not breathing could be heard when this was pointed out to WG14 (the C Standard’s committee), followed by furious argument on one side or the other.
The following illustrates one situation where the value of padding bytes could change with every access. That volatile
qualifier specifies that the value of c
could change between two accesses (e.g., it represents the storage layout of some memory mapped I/O device). Perhaps any padding bytes following it are also effectively volatile-qualified.
struct { volatile char c; // A changeable 1 byte // Possible padding bytes may be volatile int i; // No volatility here }; |
The local object x
, above, is not associated with a volatile-qualified object. But, so what? Another unwritten design aim of the C Standard is to keep the wording simple, so edge cases are not called out and the behavior intended to handle padding bytes gets applied to local unsigned chars.
A compiler could decide that calls to random
always return zero, based on the assumption that while indeterminate values may not be known, they are not time varying.
The dance of the dead cat
The flagging of unspecified behavior in C source by static analysis tools (i.e., where the standard specifies two or more possible behaviors for translators to choose from when generating code) is starting to rather get sophisticated, some tools are working out and reporting how many different program outputs are possible given the unspecified behavior in the source (traditionally tools simply point at code containing an instance of unspecified behavior).
A while ago I created a function where the number of possible different values returned, driven by one unspecified behavior in the code, was a Fibonacci number (the value of the function parameter selected the n’th Fibonacci number).
Weird ways of generating Fibonacci numbers are great for entertaining fellow developers, but developers are unlikely to be amused by a tool that flags their code as generating Fibonacci possible values (rather than the one they are expecting).
The possibility of Fibonacci different values relies on the generated code cycling through all possible behaviors, allowed by the standard, at runtime. This can only happen if program execution uses a Just-in-Time compiler; the compile once before execution implementations used by most developers get to generate far fewer possible permitted values (i.e, two for this particular example). Who would be crazy enough to write a JIT compiler for C you ask? Those crazy people who translate C to JavaScript for one.
To be useful for developers who are not using a JIT compiler to execute their C code, static analysis tools that work out the number of possible values that could be produced are going to have to support a code-compiled-before-execution
option. Tools that don’t support this option will have what they report dismissed as weird possibilities that cannot happen in the developer’s world.
In the code below, does setting the code-compiled-before-execution option guarantee that the function Schrödinger
always returns 1?
int x; int set_x(void) { x=1; return 1; } int two_values(void) { x=0; return x + set_x(); // different evaluation order -> different value } bool Schrödinger(void) { return two_values() == two_values(); } |
A compiler is free to generate code that returns a value that depends on whether two_values
appears as the left or right operand of a binary operator. Not the kind of behavior an implementor would choose on purpose, but inlining both calls could result in different evaluation orders for the two instances of the code contained in what was the function body.
Our developer friendly sophisticated static analysis tool vendor is going to have to support a different_calls_to_same_function_have_same_behavior option.
How can a developer find out whether the compiler they were using always generated code having the same external behavior for different instances of calls, in the source, to the same function? Possibilities include reading the compiler source and asking the compiler vendor (perhaps future releases of gcc and llvm will support a different_calls_to_same_function_have_same_behavior option).
As far as I know (not having spent a lot of time looking) none of the tools doing the sophisticated unspecified behavior stuff support either of the developer friendly options I am suggesting need to be supported. Tools I know about include: c-semantics (including its go faster commercial incarnation) and ch2o (which the authors tell me executes the Fibonacci code a lot faster than c-semantics; last time I looked ch2o needed more installation time for the support tools it uses than I was willing to spend, so I have not tried it); the Cerberus project has not released the source of their system yet (I’m assuming they will).
C code is 90% unspecified behavior: more uninformed scare mongering
Another C coding guidelines document, another clueless blanket ban on use of code containing unspecified behavior (no link so its visibility is not increased; the 90% is a back of the envelope calculation, knock yourself out here).
The C Standard defines unspecified behavior as “… provides two or more possibilities and imposes no further requirements on which is chosen in any instance.” Given this one item of information a ban on using constructs that contain unspecified behavior appears to be a good idea (writing code where the compiler gets to choose among several possible choices of behavior does not sound like recipe for consistent program behavior).
What most people lack when thinking about unspecified behavior is an understanding of the design aims for the production of the C Standard; the aim was to be concise. An example of this conciseness is the wording for the order of evaluation of subexpressions “… the order in which side effects take place are both unspecified.”
Consider the subexpression x+y
; should the compiler evaluate x
first (putting its value in a register) and then y
(putting its value in another register), or should it evaluate y
followed by x
? It most situations the final result does not depend on the choice of evaluation order and the Standard gives the compiler the freedom to choose the order that produces the best quality code.
A coding guideline that bans the use of code containing unspecified behavior bans the use of any binary operator (assignment is a binary operator in C, ruling out use of the statement z=0;
). The only executable statements that could be written, following this guideline, would be calls of functions containing zero or one argument (order of evaluation is unspecified, which rules out calls containing two arguments) or global variables appearing on their own in an expression statement.
One case where operand evaluation order matters is printf("Hello")+printf("World")
, which can result in either HelloWorld
or WorldHello
being printed (printf
returns the number of characters written). This is an example of the kind of usage that the authors of coding guideline want to ban.
Coming up with guideline wording that delineates the undesirable unspecified behaviors from the harmless ones is hard. Requiring that the external behavior of code does not depend on the compiler’s choice of unspecified behavior is one possibility (now that power consumption can be an external behavior of note, this framing could be too narrow). The wording used by MISRA C is “No reliance shall be placed on … unspecified behavior”; this raises the flag that it is possible to rely on unspecified behavior and leaves it up to others to fill in the details.
Fibonacci and JIT compilers
To maximize a compiler’s ability to generate high quality code many programming language standards do not specify the order in which the operands of a binary operator are evaluated. Most of the time the order of operand evaluation does not matter because all orders produce same final result. For instance in x + y
the value of x
may be obtained followed by obtaining the value of y
and the two values added, alternatively the value of y
may be obtained first followed by obtaining the value of x
and the two values added; optimizers make use of local context information (e.g., holding one of the values in a register so it is immediately available for use in the evaluation of multiple instances of x
) to work out which order produces the highest quality code.
If one of the operands is a function call that modifies the value of the other operand the result may depend on the order of evaluation. For instance, in:
int x; int set_x(void) { x=1; return 1; } int two_values(void) { x=0; return x + set_x(); } |
a left to right evaluation order of the return expression in two_values
produces the value 1, while a right to left evaluation order produces the value 2. Every now and again developers accidentally write code that does something like this and are surprised to see program behavior change when they use different compiler options, a new version of the compiler or a different compiler (all things that can result in the order of evaluation changing for a given expression).
It is generally assumed that two calls to two_values
within the same program will always produce the same answer, i.e., the decision on which order of evaluation to use is made at compile time and never changes while a program is running. Neither the C++ or C Standard requires that the evaluation order be fixed prior to program execution and use of a just-in-time compiler could result in the value returned by two_values
alternating between 1
and 2
. Some languages (e.g., Java) for which JIT compilers are expected to be used specify the exact order in which operands have to be evaluated.
One possible way of finding out whether a JIT compiler is being used for your C/C++ program is to test the values returned by two calls to two_values
. A JIT compiler may, but is not required, to return different values on each call. Of course a decide-prior-to-execution C/C++ compiler does have the option of optimizing the following function to return true
or false
on the basis that the standard permits this behavior, but I would be very surprised to see this occur in practice:
bool Schrödinger(void) { return two_values() == two_values(); } |
The runtime variability offered by JIT compilers makes it possible to write a program whose output can be any value between and (the ‘th Fibonacci number, where n
is read from the input):
#include <stdio.h> int x; int set_x(int ret_val) { x=1; return ret_val; } int two_unspec(void) { x=0; return x + set_x(1); } int add_zero(val) { x=0; return val - x + set_x(0); } int fib(int fib_num) { if (fib_num > 3) return fib(fib_num-2) + add_zero(fib(fib_num-1)); else if (fib_num == 3) return two_unspec(); else return 1; } int main(void) { int n; scanf("%d", &n); printf("Fibonacci %d = %d\n", n, filb(n)); } |
The C-semantics tool will ‘execute’ this program and produce a list of the possible outputs (a PhD project under active development; the upper limit is currently around the 6th Fibonacci number which requires 11 hours and produces a 50G log file).
A decide-prior-to-execution compiler has a maximum choice of four possible outputs for a given input value and for a given executable the output produced by different input values will be perfectly correlated.
Recent Comments