Learning a cpu’s instruction set
A few years ago I wrote about the possibility of secret instruction sets making a comeback and the minimum information needed to write a code generator. A paper from the sporadic (i.e., they don’t release umpteen slices of the same overall paper), but always interesting, group at Stanford describes a tool that goes a long way to solving the secret instruction set problem; stratified synthesis learns an instruction set, starting from a small set of known instructions.
After feeding in 51 base instructions and 11 templates, 1,795.42 instruction ‘formulas’ were learned (119.42 were 8-bit constant instructions, every variant counted as 1/256 of an instruction); out of a maximum of 3,683 possible instructions (depending on how you count instructions).
As well as discovering ‘new’ instructions, they also discovered bugs in the Intel 64 and IA-32 Architectures Software Developer Manuals. In my compiler writing days, bugs in cpu documentation were a pet hate (they cause huge amounts of time to be wasted).
The initial starting information used is rather large, from the perspective of understanding the instruction set of an unknown cpu. I’m sure others will be working to reduce the necessary startup information needed to obtain useful results. The Intel Management Engine is an obvious candidate for investigation.
Vendors sometimes add support for instructions without publicizing them and sometimes certain bit patterns happen to do something sensible in a particular version of a design because some random pattern of bits happens to do whatever it does without being treated as an illegal opcode. Your journey down the rabbit hole starts here.
On a related note, I continue to be amazed that widely used disassemblers fail to correctly handle surprisingly many, documented, x86 opcodes; benchmarks from 2010 and 2016
First use of: software, software engineering and source code
While reading some software related books/reports/articles written during the 1950s, I suddenly realized that the word ‘software’ was not being used. This set me off looking for the earliest use of various computer terms.
My search process consisted of using pfgrep on my collection of pdfs of documents from the 1950s and 60s, and looking in the index of the few old computer books I still have.
Software: The Oxford English Dictionary (OED) cites an article by John Tukey published in the American Mathematical Monthly during 1958 as the first published use of software: “The ‘software’ comprising … interpretive routines, compilers, and other aspects of automotive programming are at least as important to the modern electronic calculator as its ‘hardware’.”
I have a copy of the second edition of “An Introduction to Automatic Computers” by Ned Chapin, published in 1963, which does a great job of defining the various kinds of software. Earlier editions were published in 1955 and 1957. Did these earlier edition also contain various definitions of software? I cannot find any reasonably prices copies on the second-hand book market. Do any readers have a copy?
Update: I now have a copy of the 1957 edition of Chapin’s book. It discusses programming, but there is no mention of software.
Software engineering: The OED cites a 1966 “letter to the ACM membership” by Anthony A. Oettinger, then ACM President: “We must recognize ourselves … as members of an engineering profession, be it hardware engineering or software engineering.”
The June 1965 issue of COMPUTERS and AUTOMATION, in its Roster of organizations in the computer field, has the list of services offered by Abacus Information Management Co.: “systems software engineering”, and by Halbrecht Associates, Inc.: “software engineering”. This pushes the first use of software engineering back by a year.
Source code: The OED cites a 1965 issue of Communications ACM: “The PUFFT source language listing provides a cross-reference between the source code and the object code.”
The December 1959 Proceedings of the EASTERN JOINT COMPUTER CONFERENCE contains the article: “SIMCOM – The Simulator Compiler” by Thomas G. Sanborn. On page 140 we have: “The compiler uses this convention to aid in distinguishing between SIMCOM statements and SCAT instructions which may be included in the source code.”
Update: The October 1956 issue of Computers and Automation contains an extensive glossary. It does not have any entries for software or source code.
Running pdfgrep over the archive of documents on bitsavers would probably throw up all manners of early users of software related terms.
The paper: What’s in a name? Origins, transpositions and transformations of the triptych Algorithm -Code -Program is a detailed survey of the use of three software terms.
Computer books your great grandfather might have read
I have been reading two very different computer books written for a general readership: Giant Brains or Machines that Think, published in 1949 (with a retrospective chapter added in 1961) and LET ERMA DO IT, published in 1956.
‘Giant Brains’ by Edmund Berkeley, was very popular in its day.
Berkeley marvels at a computer performing 5,000 additions per second; performing all the calculations in a week that previously required 500 human computers (i.e., people using mechanical adding machines) working 40 hours per week. His mind staggers at the “calculating circuits being developed” that can perform 100,00 additions a second; “A mechanical brain that can do 10,000 additions a second can very easily finish almost all its work at once.”
The chapter discussing the future, “Machines that think, and what they might do for men”, sees Berkeley struggling for non-mathematical applications; a common problem with all new inventions. Automatic translator and automatic stenographer (typist who transcribe dictation) are listed. There is also a chapter on social control, which is just as applicable today.
This was the first widely read book to promote Shannon‘s idea of using the algebra invented by George Boole to analyze switching circuits symbolically (THE 1940 Masters thesis).
The ‘ERMA’ book paints a very rosy picture of the future with computer automation removing the drudgery that so many jobs require; it is so upbeat. A year later the USSR launched Sputnik and things suddenly looked a lot less rosy.
Added two more books
Cybernetics Or Communication And Control In The Animal And The Machine by Norbert Wiener
The Organization Of Behavior A Neuropsychological Theory by D. O. Hebb
and another
The Preparation of Programs for an Electronic Digital Computer by Wilkes, Wheeler, Gill (the link is to the 1957 second edition, not the 1951 first edition)
Was a C90, C99, or C11 compiler used?
How can a program figure out whether it has been compiled with a C90, C99 or C11 compiler?
Support for the //
style of commenting was added in C99.
Support for Unicode string literals (e.g., U"Hello World"
) was added in C11.
Putting these together we get the following:
#include <stdio.h> #define M(U) sizeof(U"s"[0]) int main(void) { switch(M("")*2 //**/ 2 ) { case 1: printf("C90\n"); break; case 2: printf("C99\n"); break; case 8: printf("C11\n"); break; } } |
Almost all published analysis of fault data is worthless
Faults are the subject of more published papers than any other subject in empirical software engineering. Unfortunately, over 98.5% of these fault related papers are at best worthless and at worst harmful, i.e., make recommendations whose impact may increase the number of faults.
The reason most fault papers are worthless is the data they use and the data they don’t to use.
The data used
Data on faults in programs used to be hard to obtain, a friend in a company that maintained a fault database was needed. Open source changed this. Now public fault tracking systems are available containing tens, or even hundreds, of thousands of reported faults. Anybody can report a fault, and unfortunately anybody does; there is a lot of noise mixed in with the signal. One study found 43% of reported faults were enhancement requests, the same underlying fault is reported multiple times (most eventually get marked as duplicate, at the cost of much wasted time) and …
Fault tracking systems don’t always contain all known faults. One study found that the really important faults are handled via email discussion lists, i.e., they are important enough to require involving people directly.
Other problems with fault data include: biased reported of problems, reported problem caused by a fault in a third-party library, and reported problem being intermittent or not reproducible.
Data cleaning is the essential first step that many of those who analyse fault data fail to perform.
The data not used
Users cause faults, i.e., if nobody ever used the software, no faults would be reported. This statement is as accurate as saying: “Source code causes faults”.
Reported faults are the result of software being used with a set of inputs that causes the execution of some sequence of tokens in the source code to have an effect that was not intended.
The number and kind of reported faults in a program depends on the variety of the input and the number of faults in the code.
Most fault related studies do not include any user related usage data in their analysis (the few that do really stand out from the crowd), which can lead to very wrong conclusions being drawn.
User usage data is very hard to obtain, but without it many kinds of evidence-based fault analysis are doomed to fail (giving completely misleading answers).
The first compiler was implemented in itself
I have been reading about the world’s first actual compiler (i.e., not a paper exercise), described in Corrado Böhm’s PhD thesis (French version from 1954, an English translation by Peter Sestoft). The thesis, submitted in 1951 to the Federal Technical University in Zurich, takes some untangling; when you are inventing a new field, ideas tend to be expressed using existing concepts and terminology, e.g., computer peripherals are called organs and registers are denoted by the symbol .
Böhm had work with Konrad Zuse and must have known about his language, Plankalkül. The language also has a APL feel to it (but without the vector operations).
Böhm’s language does not have a name, his thesis is really about translating mathematical expressions to machine code; the expressions are organised by what we today call basic blocks (Böhm calls them groups). The compiler for the unnamed language (along with a loader) is written in itself; a Java implementation is being worked on.
Böhm’s work is discussed in Donald Knuth’s early development of programming languages, but there is nothing like reading the actual work (if only in translation) to get a feel for it.
Update (3 days later): Correspondence with Donald Knuth.
Update (3 days later): A January 1949 memo from Haskell Curry (he of Curry fame and more recently of Haskell association) also uses the term organ. Might we claim, based on two observations on different continents, that it was in general use?
The shadow of the input distribution
Two things need to occur for a user to experience a fault in a program:
- a fault has to exist in the code,
- the user has to provide input that causes program execution to include the faulty code in a way that exhibits the incorrect behavior.
Data on the distribution of user input values is extremely rare, and we are left having to look for the shadows that the input distribution creates.
Csmith is a well-known tool for generating random C source code. I spotted an interesting plot in a compiler fuzzing paper and Yang Chen kindly sent me a copy of the data. In compiler fuzzing, source code is automatically generated and fed to the compiler, various techniques are used to figure out when the compiler gets things wrong.
The plot below is a count of the number of times each fault in gcc has been triggered (code+data). Multiple occurrences of the same fault are experienced because the necessary input values occur multiple times in the generated source code (usually in different files).
The green line is a fitted regression model, it’s a bi-exponential, i.e., the sum of two exponentials (the straight lines in red and blue).
The obvious explanation for this bi-exponential behavior (explanations invented after seeing the data can have the flavor of just-so stories, which is patently not true here 🙂 is that one exponential is driven by the presence of faults in the code and the other exponential is driven by the way in which Csmith meanders over the possible C source.
So, which exponential is generated by the faults and which by Csmith? I’m still trying to figure this out; suggestions welcome, along with alternative explanations.
Is the same pattern seen in duplicates of user reported faults? It does in the small amount of data I have; more data welcome.
Christmas books for 2017
Some suggestions for books this Christmas. As always, the timing of books I suggest is based on when they reach the top of my books-to-read pile, not when they were published.
“Life ascending: The ten great inventions of evolution” by Nick Lane. The latest thinking (as of 2010) on the major events in the evolution of life. Full of technical detail, very readable, and full of surprises (at least for me).
“How buildings learn” by Stewart Brand. Yes, I’m very late on this one. So building are just like software, people want to change them in ways not planned by their builders, they get put to all kinds of unexpected uses, some of them cannot keep up and get thrown away and rebuilt, while others age gracefully.
“Dead Man Working” by Cederström and Fleming is short and to the point (having an impact on me earlier in the year), while “No-Collar: The humane workplace and its hidden costs” by Andrew Ross is longer (first half is general, second a specific instance involving one company). Both have a coherent view work in the knowledge economy.
If you are into technical books on the knowledge economy, have a look at “Capitalism without capital” by Haskel and Westlake (the second half meanders off, covering alleged social consequences), and “Antitrust law in the new economy” by Mark R. Patterson (existing antitrust thinking is having a very hard time grappling with knowledge-based companies).
If you are into linguistics, then “Constraints on numerical expressions” by Chris Cummins (his PhD thesis is free) provides insight into implicit assumptions contained within numerical expressions (of the human conversation kind). A must read for anybody interested in automated fact checking.
ISO/IEC JTC 1/SC 42 Artificial intelligence
What has been preventing Artificial Intelligence being a success? Yes, you guessed it, until now ISO has not had an SC (Standards’ Committee) dealing with AI. Well, the votes are in and JTC 1/SC 42 Artificial intelligence is go.
Countries pay ISO to be members of an SC and the tax payers of: Austria, Canada, Finland, Germany, Ireland, Switzerland and United States have the pleasure of being founding member countries of SC42.
What standards/technical-reports are those attending SC42 meetings going to working on?
The two document titles I have seen so far are: “Artificial Intelligence Concepts and Terminology” and “Framework for Artificial Intelligence (AI) Systems Using Machine Learning (ML)”.
I hope the terminology document arrives in plenty of time, before the machines take over. The ISO Standard for Year 2000 terminology arrived in December 1999 (there was a flurry of emails desperately trying to row-back on this document).
Want to join up? Wael William Diab is the chairperson.
Vanity project or real research?
I gave a talk at the CREST Open Workshop yesterday. Many of those attending and speaking were involved in empirical work of one kind or another. My experience is that researchers involved in empirical work, in software engineering, feel the need to justify using this approach to research, because it is different from what many others in the field do. I want to reverse this perception; those not doing empirical work are the ones that should feel the need to justify their approach to research.
Evidence obtained from experiments and measurements are the basis of the scientific method.
I started by contrasting a typical software engineering researcher’s view of their work (both images from Wikipedia under a Creative Commons Attribution-ShareAlike License):
with a common industry view of academic researchers:
The reputation of those doing evidence based research is being completely overshadowed by those who use ego and bluster to promote their claims. We needed an effective label for work that is promoted using ego and bluster, and I proposed vanity projects (the work done by the ego and bluster crowd does not deserve to be referred to as research). Yes, there are plenty of snake-oil salesmen in industry, but that is another issue.
Vanity projects being passed off as research should be named and shamed.
Next time you are in the audience listening to claims made by the speaker about the results of his/her research, that are not backed up by experiments or measurements, ask them why decided to pursue a vanity project rather than proper research.
Recent Comments