Minimum information needed for writing a code generator
If a compiler writer is faced with writing a back-end for an undocumented processor, what is the minimum amount of information that needs to be reverse engineered?
It is possible to implement a universal computer using a cpu that has a single instruction, subtract-and-branch-if-lessthan-or-equal-to-zero
. This is all very well, but processors based on using a single instruction are a bit thin on the ground and the processor to hand is likely to support a larger number of simpler instructions.
A subtract-and-branch-if-lessthan-or-equal-to-zero
instruction could be implemented on a register based machine using the appropriate sequence of load-from-memory
, subtract-two-registers
, store-register-to-memory
and jump-if-subtract-lessthan-or-equal
instructions. Information about other instructions, such as add and multiply, would be useful for code optimization. (The Turing machine model of computation is sufficiently far removed from how most programs and computers operate that it is not considered further.)
Are we done? In theory yes, in practice no. A couple pf practical problems have been glossed over; how do source literals (e.g., "Hello World"
) initially get written to storage, where does the storage used by the program come from and what is the file format of an executable?
Literals that are not created using an instruction (most processors have instructions for loading an integer constant into a register) are written to a part of the executable file that is read into storage by the loader on program startup. All well and good if we know enough about the format of an executable file to be able to correct generate this information and can get the operating system to put in the desired storage location. Otherwise we have to figure out some other solution.
If we know two storage locations containing values that differ by one a sequence of instructions could subtract one value from the other to eventually obtain any desired value. This bootstrap process would speed up as a wider range of know value/location pairs was built up.
How do we go about obtaining a chunk of storage? An executable file usually contains information on the initial amount of storage needed by a program when it is loaded. Calls to the heap manager are another way of obtaining storage. Again we need to know where to write the appropriate information in the executable file.
What minimum amount of storage might be expected to be available? A program executing within a stack/heap based memory model has a default amount of storage allocated for the stack (a minimum of 16k I believe under Mac OS X or iPhone OS). A program could treat the stack as its storage. Ideally what we need is the ability to access storage via an offset from the stack pointer, at worse we would have to adjust the stack pointer to the appropriate offset, pop the value into a register and then reset the stack pointer; storing would involve a push.
Having performed some calculation we probably want to communicate one or more values to the outside world. A call to a library routine, such as printf
, needs information on the parameter passing conventions (e.g., which parameters get stored in which registers or have storage allocated for them {a function returning a structure type usually has the necessary storage allocated by the calling function with the address passed as an extra parameter}) and the location of any return value. If ABI information is not available a bit of lateral thinking might be needed to come up with an alternative output method.
I have not said anything about making use of signals and exception handling. These are hard enough to get right when documentation is available. The Turing machine theory folk usually skip over these real-world issues and I will join them for the time being.
Criteria for knowing a language
What does it mean for somebody to claim to know a computer language? In the commercial world it means the person is claiming to be capable of fluently (i.e., only using knowledge contained in their head and without having to unduly ponder) reading, and writing code in some generally accepted style applicable to that language. The academic world generally sets a much lower standard of competence (perhaps because most of its inhabitants leave before any significant expertise is acquired). If I had a penny for every recent graduate who claimed to know a language and was incapable of writing a program that read in a list of integers and printed their sum (I know companies that set tougher problems, but they do not seem to have higher failure rates), I would be a rich man.
One experiment asked 21 postgraduate and academic staff which of the following individuals they would regard as knowing Java:
The results were:
_ NO YES
A 21 0
B 18 3
C 16 5
D 8 13
E 0 21
These answers reflect the environment from which the subjects were drawn. When I wrote compilers for a living, I did not consider that anybody knew a language unless they had written a compiler for it, a point of view echoed by other compiler writers I knew.
I’m not sure that commercial developers would be happy with answer (E), in fact they could probably expand (E) into five separate questions that tested the degree to which a person was able to combine various elements of the language to create a meaningful whole. In the commercial world, stage (E) is where people are expected to start.
The criteria used to decide whether somebody knows a language depends on which group of people you talk to; academics, professional developers and compiler writers each have their own in-group standards. In a sense the question is irrelevant, a small amount of language knowledge applied well can be used to do a reasonable job of creating a program for most applications.
Recent Comments