2019 in the programming language standards’ world
Last Tuesday I was at the British Standards Institute for a meeting of IST/5, the committee responsible for programming language standards in the UK.
There has been progress on a few issues discussed last year, and one interesting point came up.
It is starting to look as if there might be another iteration of the Cobol Standard. A handful of people, in various countries, have started to nibble around the edges of various new (in the Cobol sense) features. No, the INCITS Cobol committee (the people who used to do all the heavy lifting) has not been reformed; the work now appears to be driven by people who cannot let go of their involvement in Cobol standards.
ISO/IEC 23360-1:2006, the ISO version of the Linux Base Standard, has been updated and we were asked for a UK position on the document being published. Abstain seemed to be the only sensible option.
Our WG20 representative reported that the ongoing debate over pile of poo emoji has crossed the chasm (he did not exactly phrase it like that). Vendors want to have the freedom to specify code-points for use with their own emoji, e.g., pineapple emoji. The heady days, of a few short years ago, when an encoding for all the world’s character symbols seemed possible, have become a distant memory (the number of unhandled logographs on ancient pots and clay tablets was declining rapidly). Who could have predicted that the dream of a complete encoding of the symbols used by all the world’s languages would be dashed by pile of poo emoji?
The interesting news is from WG9. The document intended to become the Ada20 standard was due to enter the voting process in June, i.e., the committee considered it done. At the end of April the main Ada compiler vendor asked for the schedule to be slipped by a year or two, to enable them to get some implementation experience with the new features; oops. I have been predicting that in the future language ‘standards’ will be decided by the main compiler vendors, and the future is finally starting to arrive. What is the incentive for the GNAT compiler people to pay any attention to proposals written by a bunch of non-customers (ok, some of them might work for customers)? One answer is that Ada users tend to be large bureaucratic organizations (e.g., the DOD), who like to follow standards, and might fund GNAT to implement the new document (perhaps this delay by GNAT is all about funding, or lack thereof).
Right on cue, C++ users have started to notice that C++20’s added support for a system header with the name version
, which conflicts with much existing practice of using a file called version
to contain versioning information; a problem if the header search path used the compiler includes a project’s top-level directory (which is where the versioning file version
often sits). So the WG21 committee decides on what it thinks is a good idea, implementors implement it, and users complain; implementors now have a good reason to not follow a requirement in the standard, to keep users happy. Will WG21 be apologetic, or get all high and mighty; we will have to wait and see.
The first commercially available (claimed) verified compiler
Yesterday, I read a paper containing a new claim by some of those involved with CompCert (yes, they of soap powder advertising fame): “CompCert is the first commercially available optimizing compiler that is formally verified, using machine assisted mathematical proofs, to be exempt from miscompilation”.
First commercially available; really? Surely there are earlier claims of verified compilers being commercial availability. Note, I’m saying claims; bits of the CompCert compiler have involved mathematical proofs (i.e., code generation), so I’m considering earlier claims having at least the level of intellectual honesty used in some CompCert papers (a very low bar).
What does commercially available mean? The CompCert system is open source (but is not free software), so I guess it’s commercially available via free downloading licensing from AbsInt (the paper does not define the term).
Computational Logic, Inc is the name that springs to mind, when thinking of commercial and formal verification. They were active from 1983 to 1997, and published some very interesting technical reports about their work (sadly there are gaps in the archive). One project was A Mechanically Verified Code Generator (in 1989) and their Gypsy system (a Pascal-like language+IDE) provided an environment for doing proofs of programs (I cannot find any reports online). Piton was a high-level assembler and there was a mechanically verified implementation (in 1988).
There is the Danish work on the formal specification of the code generators for their Ada compiler (while there was a formal specification of the Ada semantics in VDM, code generators tend to be much simpler beasts, i.e., a lot less work is needed in formal verification). The paper I have is: “Retargeting and rehosting the DDC Ada compiler system: A case study – the Honeywell DPS 6” by Clemmensen, from 1986 (cannot find an online copy). This Ada compiler was used by various hardware manufacturers, so it was definitely commercially available for (lots of) money.
Are then there any earlier verified compilers with a commercial connection? There is A PRACTICAL FORMAL SEMANTIC DEFINITION AND VERIFICATION SYSTEM FOR TYPED LISP, from 1976, which has “… has proved a number of interesting, non-trivial theorems including the total correctness of an algorithm which sorts by successive merging, the total correctness of the McCarthy-Painter compiler for expressions, …” (which sounds like a code generator, or part of one, to me).
Francis Morris’s thesis, from 1972, proves the correctness of compilers for three languages (each language contained a single feature) and discusses how these features may be combined into a more “realistic” language. No mention of commercial availability, but I cannot see the demand being that great.
The definition of PL/1 was written in VDM, a formal language. PL/1 is a huge language and there were lots of subsets. Were there any claims of formal verification of a subset compiler for PL/1? I have had little contact with the PL/1 world, so am not in a good position to know. Anybody?
Over to you dear reader. Are there any earlier claims of verified compilers and commercial availability?
Microcomputers ‘killed’ Ada
In the mid-70s the US Department of Defense decided it could save lots of money by getting all its contractors to write code in the same programming language. In February 1980 a language was chosen, Ada, but by the end of the decade the DoD had snatched defeat from the jaws of victory; what happened?
I think microcomputers are what happened; these created whole new market ecosystems, some of which were much larger than the ecosystems that mainframes and minicomputers sold into.
These new ecosystems sucked up nearly all the available software developer mind-share; the DoD went from being a major employer of developers to a niche player. Developers did not want a job using Ada because they thought that being type-cast as Ada programmers would overly restrict their future job opportunities; Ada was perceived as a DoD only language (actually there was so little Ada code in the DoD, that only by counting new project starts could it get any serious ranking).
Lots of people were blindsided by the rapid rise (to world domination) of microcomputers. Compilers could profitably be sold (in some cases) for thousands or hundreds of dollars/pounds because the markets were large enough for this to be economically viable. In the DoD ecosystems, compilers had to be sold for thousands or hundreds of thousands of dollars/pounds because the markets were small. Micros were everywhere and being programmed in languages other than Ada; cheap Ada compilers arrived after today’s popular languages had taken off. There is no guarantee that cheap compilers would have made Ada a success, but they would have ensured the language was a serious contender in the popularity stakes.
By the start of the 1990s, Ada supporters were reduced to funding studies to produce glowing reports of the advantages of Ada compared to C/C++ and how Ada had many more compilers, tools and training than C++. Even the 1991 mandate “… where cost-effective, all Department of Defense software shall be written in the programming language Ada, in the absence of special exemption by an official designated by the Secretary of Defense.” failed to have an impact and was withdrawn in 1997.
The Ada mandate was cancelled as the rise of the Internet created even bigger markets, which attracted developer mind-share towards even newer languages, further reducing the comparative size of the Ada niche.
Astute readers will notice that I have not said anything about the technical merits of Ada, compared to other languages. Like all languages, Ada has its fanbois; these are essentially much older versions of the millennial fanbois of the latest web languages (e.g., Go and Rust). There is virtually no experimental evidence that any feature of any language is best/worse than any feature in any other language (a few experiments showing weak support for stronger typing). To its credit, the DoD did fund a few studies, but these used small samples (there was not yet enough Ada usage to make larger sample possible) that were suspiciously glowing in their support of Ada.
2017 in the programming language standards’ world
Yesterday I was at the British Standards Institution for a meeting of IST/5, the committee responsible for programming languages.
The amount of management control over those wanting to get to the meeting room, from outside the building, has increased. There is now a sensor activated sliding door between the car-park and side-walk from the rear of the building to the front, and there are now two receptions; the ground floor reception gets visitors a pass to the first floor, where a pass to the fifth floor is obtained from another reception (I was totally confused by being told to go to the first floor, which housed the canteen last time I was there, and still does, the second reception is perched just inside the automatic barriers to the canteen {these barriers are also new; the food is reasonable, but not free}).
Visitors are supposed to show proof that they are attending a meeting, such as a meeting calling notice or an agenda. I have always managed to look sufficiently important/knowledgeable/harmless to get in without showing any such documents. I was asked to show them this time, perhaps my image is slipping, but my obvious bafflement at the new setup rescued me.
Why does BSI do this? My theory is that it’s all about image, BSI is the UK’s standard setting body and as such has to be seen to follow these standards. There is probably some security standard for rules to follow to prevent people sneaking into buildings. It could be argued that the name British Standards is enough to put anybody off wanting to enter the building in the first place, but this does not sound like a good rationale for BSI to give. Instead, we have lots of sliding doors/gates, multiple receptions (I suspect this has more to do with a building management cat fight over reception costs), lifts with no buttons ‘inside’ for selecting floors, and proof of reasons to be in the building.
There are also new chairs in the open spaces. The chairs have very high backs and side-baffles that surround the head area, excellent for having secret conversations and in-tune with all the security. These open areas are an image of what people in the 1970s thought the future would look like (BSI is a traditional organization after all).
So what happened in the meeting?
Cobol standard’s work becomes even more dead. PL22.4, the US Cobol group is no more (there were insufficient people willing to pay membership fees, so the group was closed down).
People are continuing to work on Fortran (still the language of choice for supercomputer Apps), Ada (some new people have started attending meetings and support for @
is still being fought over), C, Internationalization (all about character sets these days). Unprompted somebody pointed out that the UK C++ panel seemed to be attracting lots of people from the financial industry (I was very professional and did not relay my theory that it’s all about bored consultants wanting an outlet for their creative urges).
SC22, the ISO committee responsible for programming languages, is meeting at BSI next month, and our chairman asked if any of us planned to attend. The chair’s response, to my request to sell the meeting to us, was that his vocabulary was not up to the task; a two-day management meeting (no technical discussions permitted at this level) on programming languages is that exciting (and they are setting up a special reception so that visitors don’t have to go to the first floor to get a pass to attend a meeting on the ground floor).
Ada updated to support undefined behavior
The Ada language has now almost completely disappeared from developer consciousness and the Ada Standards’ committee has decided this desperate situation calls for desperate actions.
Undefined behavior in programming languages has been getting a lot of publicity over the last few years. It might not be good publicity, but the Ada committee has taken to heart the old adage ‘The only thing worse than being talked about is not being talked about.’
While the Ada standard does not use the phrase undefined behavior, it does contain a very close equivalent. The Ada community has bitten the bullet and decreed that constructs previously denoted by the term bounded error shall henceforth be referred to as undefined behavior. Technically, constructs generating a bounded error have effectively always been undefined behavior, but the original positioning of Ada as a sophisticated language suitable for critical applications required something less in-your-face unsafe sounding.
Will this minor change of wording (ISO rules are very strict on what changes can be made to published standards without going through long winded voting procedures) make any difference? I guess it will enable Ada users to jump on the ‘undefined behavior’ bandwagon (their claims that Ada was better because it did not contain undefined behavior always has the effect of annoying people, rather than generating any interest in changing languages).
I think there is a bigger public perception problem than the terminology used to denote undefined behavior. Ada Lovelace was born 200 years ago and gets lots of publicity as the first programmer; somehow this association with Lovelace has transferred to the language named after her, generating a perception of an old, fuddy-duddy language (being strongly typed has not helped, but this feature does appear to be coming back into fashion).
The following are some numbers from a while ago (both standards have been revised since these measurements were made).
In the Ada 2005 Standard there are:
36 occurrences of bounded error. 53 occurrences of the subsection header Erroneous execution and 116 occurrences of erroneous. 343 occurrences of implementation-defined. 373 occurrences of may, some of which describe optional behavior. 22 occurrences of must some of which that read as-if shall was intended. 38 occurrences of optional. zero occurrences of processor dependent and processor-dependent. 1,018 occurrences of shall of which 131 have the form shall not. 452 occurrences of should. 8 occurrences of undefined, one referencing to an undefined range, three having the form undefined range and the rest occurring in annexes (also see bounded error). • 89 occurrences of unspecified.
In the C99 Standard there are:
zero occurrences of bounded error. 3 occurrences of erroneous. 163 occurrences of implementation-defined. 862 occurrences of may. 1 occurrence of must. 34 occurrences of optional. zero occurrences of processor dependent and processor-dependent. 596 occurrences of shall of which 71 have the form shall not. 63 occurrences of should. 183 occurrences of undefined. 98 occurrences of unspecified.
In the C++ 2003 Standard (which now contains many more pages) there are:
zero occurrences of bounded error. 5 occurrences of erroneous. 236 occurrences of implementation-defined. 371 occurrences of may. 111 occurrences of must. 30 occurrences of optional. zero occurrences of processor dependent and processor-dependent. 779 occurrences of shall of which 211 have the form shall not. 38 occurrences of should. 195 occurrences of undefined. 113 occurrences of unspecified.
Operator assignment may be in the next Ada standard
WG9, the Ada Standard committee, are looking into adding support for operator assignment in the next revision of the language standard. Ada was designed back in the day when C’s operator assignments, such as +=
, were a novelty (all the languages we know today that follow C’s lead, including choice of operator precedence, did not yet exist); developers using serious languages knew that typing everything out in full was good for you.
It’s amazing to read about “… those who have been damaged by terser languages…”, in the proposal comments. Do I only think that My_Package.My_Array(I).Field +:= 1;
is so much easier to read than My_Package.My_Array(I).Field := My_Package.My_Array(I).Field + 1;
because my brain has been damaged? In the second statement I have to parse two subexpressions and notice that they refer to the same object, while the first statement only requires me to parse one subexpression and the +:=
operator tell me that something is being added to its value. The first statement requires a lot less effort and is likely to be less error prone (I bet Ada programs make cut and paste error just like the rest of us).
Did you notice the use of round brackets to specify an array index? Yes, the eighties were the decade when the programming language world was badgered into being politically correct and only require the use of characters that appeared on every European country’s keyboard.
The Ada proposal started out following the C convention, e.g., +=
, *=
, etc, then switched to :+
, :*
, etc. The assignment token in Ada is :=
and there is a logic to switching out the =
for another operator. However, I would argue that +:
, *:
, etc would be a less error prone ordering or characters. People pay more attention to the first character in a sequence and somebody in a hurry may not notice that the colon is not followed by a =
; putting the ‘unexpected’ character first is more likely to catch reader attention.
Another proposal is for the @
token to act as a placeholder. The only place I can see this being useful on any regular basis is in bitwise manipulation: My_Package.My_Array(I).Field := (@ << 16) or @ ;
(Ada does not support <<
as an operator, so reality is a bit more complicated than this). I don't think there are enough situations where placeholders would be useful to warrant using a solution that is more complicated than operator assignment.
One proposed advantage of the new fangled operators is making life easier for non-Ada users (see the discussion in the proposal comments). The I have put lots of efforts into Ada, why should others have it easy die-hards are fighting a rear guard action to make sure life is not too easy for newcomers.
C++ vs. Ada: Which language is more strongly typed?
Programming languages are sometimes categorized as being either weakly or strongly typed. I’m not going to join the often rabid debates over which category a particular language belongs to, but rather discuss the relative type strengths of two languages, C++ and Ada, to see if it is possible to claim that one of them is more strongly typed than the other.
Most programming languages support variables having more than one type (e.g., integer and float are two very common types) and have rules specifying which combinations of differently typed values/variables are permitted to occur in a given context, e.g., C++ allows a value of type int
to be assigned to a variable of type float
(an implicit conversion is performed), but Ada not perform this implicit conversion and the integer value has to be explicit converted to float before it can be assigned (otherwise a compile time error will be generated).
The more implicit type conversions a language supports the weaker its type system is said to be.
C++ supports more implicit conversions than Ada (others include boolean
/int
and char
/int
) and loose type strength points because of this (there is plenty of scope for debate about whether some implicit conversions are more evil than others, but cost/benefit debates are harder to come by).
While C++/Ada differ in their support for implicit conversions they are pretty equal in their support for explicit conversions (e.g., in Ada the code float(23)
would convert the integer 23 to a float type). In some cases Ada requires that various hoops be jumped through to make the conversion happen (representation clauses are a great topic to bring up when being lectured about how type safe Ada is, a bit like telling somebody being snobbish that they go to the bathroom like everybody else).
The underlying idea is that the compilation errors generated by these ‘undesirable’ attempted implicit conversions alert the developer to a possible mistake in what they have written. These kinds of messages from the compiler have certainly caught errors in my code, but often the error has been a failure to write the required explicit conversion; every now and again a ‘real’ error is flagged. But I digress, this discussion is about what weak/strong typing is, not about what its costs and benefits might be.
Does Ada have any other feature that increases its type strength with respect to C++?
Both languages allow names to be given to existing types: typedef length_t int;
in C++ and subtype length_t is integer;
in Ada both define length_t
to be a synonym for the integer type, but without resulting in any extra type checks occurring. However, Ada supports a kind of type definition mechanism that does result in extra checks being made by the compiler. In the following code:
subtype length_t is integer; type time_t is integer; a : integer; b : length_t; c : time_t; begin a := b; -- OK a := c; -- Error, type mismatch a := integer(c); -- OK, explicit conversion |
time_t
is defined to have a type that is not compatible with integer
, even although its underlying representation is the same as the integer type. Mixing variables having types integer
and time_t
results in a compile time error.
The intended purpose for defining a ‘new’ type and creating variables having that type is to restrict operations on those variables to being with other variables having the same type, e.g., assignment and addition between any variables having type time_t
is fine but involving other types results in a compile time error (there are special rules that allow integer literals to general get mixed in). I find that the errors flagged by this kind of checking are mostly very useful.
It is also possible to achieve the same kind of type checking in C++ using template metaprogramming, e.g., the SIunits library. In fact using this technique enables C++ to support a much more general and user friendly range of of ‘strong type’ functionality than is supported by the built-in Ada functionality (it is also possible to use general language functionality in Ada to implement the kind of checking possible in C++, however prior to the 2012 Ada standard the checks occurred at runtime but it now looks like there is a mechanism for doing them at compile time {because it is often possible to switch off runtime checks some people consider them to be weaker than compile time checks})
Fans of subranges (I dearly miss this feature when using C-like languages) can get their fix here.
Is there a rule that extra type strength points are given if a language contains explicit type creation syntax (Ada contains a menagerie of syntax and semantics for doing this kind of stuff), compared to languages that require the use of constructs having many other uses? I don’t see why there should be. The fact that template metaprogramming makes a lot of C++ developers’ head’s hurt means that many will limit themselves to using what others have created, rather than growing project specific libraries; but since when have usability and frequency of use been a major issue in the weak/string type debate?
The score so far is that C++ has lots points to Ada because of its greater support for implicit type conversions, but has held its ground everywhere else.
Can either language pick up any more points?
There is the culture angle. Ada developers have a culture of making use of the type checking functionality provided by the language; this is a skill that has to be learned, you get some type checking for free but the rest has to be designed into the code. C++ developers also have a culture of making use of the type checking functionality provided by the language, i.e., most do not use add-on packages like SIunits.
I am not aware of any studies that have investigated the extent to which developers make use of type checking functionality; pointers to such studies welcome. If there is more ‘strongly typed’ C++ than Ada code out there it is only because there is a lot more C++ code out there.
It is my experience that culture and existing code do color developers’ position on where to draw the line in the weak/strong debate, but don’t effect relative language orderings.
The conclusion is that Ada is more strongly typed than C++, but how much more strongly typed remaines an open question. Both languages require effort from the developer to make full use of the typing functionality that is available.
Undefined behavior is a design decision
Every few years or so some group of people in the C/C++ community start writing about the constructs specified as having undefined behavior in those languages. A topic that always seems to be skipped is why a language committee would choose to specify that the behavior in a particular case was undefined.
A quick refresher for readers on the definition of Undefined behavior, from the C Standard: “behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements”. The two key features are that the behavior applies when an error has occurred and any behavior whatsoever is permitted after one of these errors occurs. Examples of constructs that have undefined behavior are divide by zero, the result of an arithmetic operation on a signed value not being representable in its type (i.e., overflowing) and indexing an array outside of its defined bounds.
The point to note about all undefined behaviors is that the C/C++ language committee could have chosen to specify the behavior that a conforming implementation is required to support. Some language specifications do attempt to explicitly define the behavior for all constructs, e.g., Java, while other languages have a smaller set of undefined behaviors (e.g., Ada, which uses the term Bounded error instead of undefined behavior; there are 35 of them in Ada 2005). To understand why languages take these different approaches we need to look at the language design aims.
The design aims of C included it being implementable for any processor and for the generated code to be efficient (I’m not sure to what extent these might still be major design aims for C++). Computing systems come in all shapes and sizes, some with hundreds of bytes of memory and others with gigabytes, some raise exceptions when certain operations occur while others set processor status flags and others don’t do anything special.
A willingness to accept whatever behavior happens to occur, in an error situation, is the price that has to be paid for efficient execution on a wide range of disparate processors. The C/C++ designers were willing to pay this price while while the Java designers were not, with the Ada designers willing to tolerate less variability than C/C++.
Undefined behavior need not be nasty behavior, an implementation could chose to generate a helpful message or try to recover from it.
There are C tools and compilers (when certain options are specified) that check, at runtime, for various kinds of undefined behavior. I am in the minority in having Boundschecker installed as my default C compiler (as the name suggests it checks that array and pointer accesses to an object are within the defined bounds); for reasons I don’t understand few C/C++ developers are willing to use tools like this. For production code I use a non-boundschecking compiler; I don’t know whether Ada programs are tested with the mandated bounds checking switched on and then have it switched off for the production version (this is what Pascal developers do, in my experience). Of course Java developers have no choice but to permanently live with checking turned on.
The number of companies that make a living selling runtime checking tools is a small percentage of the number of companies based on selling static analysis tools. There continues to be a steady stream of runtime checking tools appearing and quickly disappearing, but until a developers start being sent to jail for faults in their code I don’t foresee the market growing.
Optimizations to figure out when code need not be generated to perform a bounds check because the access is known to be within bounds is an active research area. These days the performance penalty is not so much executing the checking instructions but the disruption to the instruction pipeline caused by the branches that might be taken (if the bounds check fails).
The cost of all the checking required by Java is that the minimal permitted configuration requires at least 256K of memory (Oracle’s K virtual machine, used by the Java Micro Edition which is intended for embedded systems, also makes floating-point optional and allows implementations some freedom in how some constructs are handled). So the Java motto really out to be “Write once, run anywhere with at least 256K and don’t depend on floating-point”.
I have heard stories of Ada code being liberally scattered with various forms of unchecked (how the developer tells the compiler not to do any runtime checking) but have not seen any empirical analysis (a study of goto usage in Ada did not have any trouble finding plenty of uses to analyze).
Memory capacity and commercial compiler development
When I started out in the compiler business in the 80s many commercial compilers were originally written by one person. A very good person who dedicated himself (I have never heard of a woman doing this) to the job (i.e., minimal social life) could produce a commercially viable product for a non-huge language (e.g., Fortran, Pascal, C, etc and not C++, Ada, etc) in 12-18 months. Companies who decide to develop a compiler in-house tend to use a team of people and take longer because that is how they work, and they don’t want to depend on one person and anyway such a person might not be available to them.
Commercially viable compiler development stayed within the realm of an individual effort until sometime in the early 90s. What put the single individual out of business was the growth in computer memory capacity into the hundreds of megabytes and beyond. Compilers have to be able to run within the limits of developer machines; you won’t sell many compilers that require 100M of memory if most of your customers don’t have machines with 100M+ of memory.
Code optimizers eat memory and this prevented many optimizations that had been known about for years appearing in commercial products. Once the machines used for software development commonly contained 100M+ of memory compiler vendors could start to incorporate these optimizations into their products.
Improvements in processor speed also helped. But developers are usually willing to let the compiler take a long time to optimize the code going into a final build, provided development compiles run at a reasonable speed.
The increase in memory capacity created the opportunity for compilers to improve and when some did improve they made it harder for others to compete. Suddenly an individual had to spend that 12-18 months, plus many more months implementing fancy optimizations; developing a commercially viable compiler outgrew the realm of individual effort.
Some people claim that the open source model was the primary driver in killing off commercial C compiler development. I disagree. My experience of licensing compiler source was that companies preferred a commercial model they were familiar with and reacted strongly against the idea of having to make available any changes they made to the code of an open source program. GCC (and recently llvm) succeeded because many talented people contributed fancy optimizations and these could be incorporated because developer machines contained lots of memory. If storage had not dramatically increased gcc would probably not be the market leader it is today.
Recent Comments