Maximizing profit selling C compilers
Upgrades are the lifeblood of established software companies. I recently came across the paper Information Goods Upgrades: Theory and Evidence and what caught my attention was one of the datasets the author had collected, first purchase and upgrade price of various PC C/C++ compilers between 1987 and 1997. What’s more the author still had the data and was willing to share it, yay!
By the early 1990s I was no longer actively involved in C compilers, but was involved in C static analysis on non-PC platforms. So my view of the 1990s C compiler market is a bit sketchy.
Compiler companies, like other companies, want to maximize their revenue and THE decision that has to be made is the price to charge for a compiler (compiler writers are also developers and hate high prices for compilers and those that failed to charge enough for their product soon went bust). My recollection is that compiler pricing was based around the spending authority of a senior development engineer and also what other companies were charging. Just under £500 was common, with a few companies failing to make a go of selling around the £100 mark. Zorland (later renamed to Zortech) gained huge market share in the mid/late 1980s selling a great C compiler for £29, but a few years later were selling a C++ compiler for a lot more.
To some extent each compiler vendor operates in a monopoly market; developers write code that depends on the features supported by the compiler used and it can be very expensive to port code to a different compiler. How much can vendors charge for a compiler upgrade? Selling the product at a high price provides a rationale for higher priced upgrades (the percentage discount will look good). I wonder how many vendors continued to advertise a high price product just to justify a high upgrade price.
Management always feel an affinity for the OS vendor and Microsoft sold a C compiler and later a C++ compiler. They were both awful and easy, product quality wise, to compete against. Microsoft had to have their own compiler for strategic internal use, with sales to developers being insignificant compared to sales of Word and Excel (Microsoft compiler people I talked to at the time said they had thought of giving the compiler away for free and later it was possible to essentially get the compiler for free by joining the various developer programs). Over time Microsoft improved and compiler companies found easier ways to make money, so the number of compiler vendors dropped to almost one (a company selling C compiler validation suites once told me in the late 1990s that they had sold over 150 copies; someone has to be serious about their compiler to shell out $5,000-$10,000 for software to test it).
By the late 1980s the C compiler market was quite saturated and vendors needed something else to sell. IDEs and debuggers were popular choices. Then along came C++. Yay! A new language meant a new compiler to sell. Compiler vendors’ need for a new compiler to sell is a significantly underestimated factor in C++ gaining traction in developer mind share.
A rarely talked about compiler revenue stream is being paid to port a compiler to a new platform (either because there is an important application hat depend son it or because the platform does not yet have a C compiler). This is the market where gcc had its first successes. Its hard to say whether gcc spread because these niche platforms spread or because gcc cut off revenue to compiler vendors making remaining in the compiler market unattractive to them.
I don’t have any sales figures for any ‘mass’ market C compilers or compilers for any languages. Can any readers help out? In fact any data on compiler sales would be most welcome.
The compiler/interpreter distinction
What is the difference between compiled and interpreted programs?
In the good-old-days the distinction was easy to make: compiled code is executed by hardware while interpreted code is executed by software.
These days it can be very difficult to decide whether a program will be executed by hardware or software, and in some cases both may occur. More complicated cpus implement some of their instructions in micro-code (software control of very low level hardware resources) and the virtual machines specified for software execution can be implemented in hardware (an interesting project for a group of talented students in their summer holidays wanting to learn about ASICs).
Some people make a distinction based on the abstraction level of the cpu specification, e.g., very high level abstraction means the code must be interpreted. In practice the implementation of a cpu specification in hardware or software is an economic decision (software may be slow, but its a lot cheaper to implement).
I think there is a compiler/interpreter distinction, but the difference is not about how code is executed (the hardware/software distinction is a convenient difference that is easy to explain).
The compiler/interpreter distinction is a difference of responsibility. Compilers treat programs like the Spartans treated their children, they are bundled into a file of the appropriate format and left for the Operating system to load into memory and point the cpu at the first instruction (a cpu’s one interest is executing the sequences of instructions pointed to by the program counter). Interpreters are more like dotting nannies, organizing the provision of memory and on call to provide access to the desired resources.
Sometimes a language is classified as an interpreted language. There is no such thing as an interpreted language, only languages which are much more easily implemented using an interpreter than a compiler.
The performance of the Spartan approach may be very desirable, but the cost of achieving it can be very high.
Actively maintained production compilers for middle-age languages
The owners of the Borland C++ compiler have stopped maintaining it. So we are now down to, by my counting, three four different production quality C++ compilers still being actively maintained (Visual C++ {the command line c1.exe
, not the interactive IDE compiler}, GCC, LLVM and EDG); lots of companies repackage EDG and don’t talk about it.
How many production compilers for other middle-age languages are still being actively maintained?
Ada I think is now down to one (GNAT; I’m not sure of the status of what was the Intermetrics compiler).
Cobol has two+ (I’m not sure how many internal compilers IBM has, some of which are really Microfocus) that I know of (Microfocus and Fujitsu {was ACUCobol}).
Fortran probably needs more than one hand to count its compilers. Nothing like having large engineering applications using the languages features supported by your compiler to keep the maintenance fees rolling in.
C still has lots of compilers (a C validation suite vendor told me many years ago that they had over 150 customers). Embedded processors can be a very tough target for the general purpose algorithms used in GCC and LLVM, so vendors with hand crafted compilers can still eek out a living.
Perl has one (which I find surprising).
R has one, but like Cobol it is not a fashionable language in compiler writing circles. Over the last couple of years there have been a few ‘play’ implementations and rumors of people creating a new production quality implementation.
Lisp has one or millions, depending on how you view dialects or there could be a million people with a different view on the identity of the 1.
Snobol-4 still has one (yes, I am a fan of this language).
There are lots of languages which have not yet reached middle-age, so its too soon to start counting how many actively supported compilers they still have in production use.
Survey of instruction selection
A well written survey of compiler instruction selection has just become available, the first major survey of this topic in 30 years! The academic outlook of the author is given away by the evaluation “…the technique appears to have had very limited impact as the citation count for the paper is low.” and coverage for the last 10 years does tend to thin out (but that could fill another 100 pages). Whatever your interest in compilers this survey is well worth a read.
Anybody reading a compiler book could be forgiven for thinking that instruction set selection was a minor issue; Gabriel Hjort Blindell counted 160 pages devoted to the topic out of 4,600 pages in seven well known compiler books. In a production compiler it is the parsing and semantics that consume 3% of the code with optimization and code generation making up the other 97%.
A 100 page survey of register allocation is also overdue (20 pages is a bit short).
Instruction set selection is one quarter of code generation, another quarter being register allocation and the remaining half being how these two are woven together (Hjort Blindell lists instruction scheduling as a third component and we could all argue for hours about whether this is another optimization, something that is spread over instruction selection/register allocation or a distinct component).
For a given choice of registers there are algorithms that will select the optimal code and for a given sequence of code there are algorithms that will select the optimal registers to use. Papers covering the optimal selection of both registers and instructions are thin on the ground; this is something of a black art that is picked up by building a production compiler.
Verified compilers and soap powder advertising
There’s a new paper out claiming to be about a formally-verified C compiler, it even states a Theorem about its abilities! If this paper appeared as part of a Soap powder advert the Advertising Standards Authority would probably require clarification of the claims. What clarifications might appear in the small print tucked away at the bottom of the ad?
- C source code is not verified directly, it is first translated to the formal notations used by the verification system; the software that performs this translation is assumed to be correct.
- The CompCert system may successfully translate programs containing undefined behavior. Any proof statements made about such programs may not be valid.
- The support tools are assumed to be correct; primarily the Coq proof assistant, which is written in OCaml.
- The CompCert system makes decisions about implementation dependent behaviors and any proofs only apply in the context of these decisions.
- The CompCert system makes decisions about unspecified behaviors and any proofs only apply in the context of these decisions.
Some notes on the small print:
The C source translator used by CompCert rarely gets mentioned in any of the published papers; what was done to check its accuracy (I have previously discussed some options)? Presumably the developers who wrote it tried very hard to make sure they did a good job, just like the authors of f2c, a Fortran to C translator, did. Connecting f2c as a front-end of the CompCert system gives us a verified Fortran compiler! I think the f2c translator is much more likely to be correct than the CompCert C source translator, it has been used by a lot more people, processed a lot more source and maintained over a longer period.
When they encounter undefined behavior in source code production C compilers sometimes generate code that has very unexpected behavior. Using the CompCert system will not avoid unexpected behavior in these situations; CompCert simply washes its hands for this kind of code and says all bets are off.
Proving the support tools correct would simply move the assumption of correctness to a different set of tools. I am not aware of any major effort to test whether the Coq system behaves as intended, but have not read all the papers describing it (the list of reported faults is does not appear to be publicly available); bugs have been found in the OCaml implementation.
Like all compilers that generate code, CompCert has to make implementation dependent decisions and select one of the possible unspecified behaviors. The C-Semantics tool generates all unspecified behaviors, rather than just one.
Only compiler vendor customers, not its users, count
The hardest thing about working on compilers is getting somebody to pay you to do it (its a close run race against having the cpu instructions chop and change under you during initial development, but that’s another story). The major shift of compiler vendor business model from proprietary to open source has significant implications for users of compilers. Note I said user not customer, only one of them pays money. Under the commercial model there was usually a very direct connection between compiler user and customer (even in large organizations users rather than the manager who makes the purchase decision are often regarded by vendors as the customer), while under the Open source model most users are not customers (paying money for a distribution does not make you a customer of the people maintaining the compiler who probably don’t receive any of the money you spent).
Like all good businesses compiler vendors don’t want to make their customers unhappy. There is one way guaranteed to make all customers so unhappy that they will remember the experience for years; ship a new compiler release that breaks their existing code (this usually happens because their is a previously undetected bug in the code or because use is being made of an implemented defined/undefined part of the language {the compiler gets to decide what to do when it encounters such code}). Not breaking existing customer code is priority ONE in any commercial compiler development group.
Proprietary vendors have so many customers its almost impossible for them to know in advance what changes will break existing code and the only option is to be ultra conservative about adding new code optimizations (new optimizations can so easily change how source containing undefined behavior is processed). Ultra conservative is the polite term, management paranoia would be more accurate. There is another advantage to vendors for not breaking their customers’ code, they are protected against competition by new market entrants; a new vendor with a shiny go faster compiler doing all the optimizations the existing vendor was not willing to do in case it broke existing code will quickly find out that the performance improvements they offer are rarely big enough to tempt potential customers to switch compilers. Really, the only time companies switch compiler is when they have to port to a new platform to make a sale or their existing vendor goes bust.
Open source vendors (e.g., those commercially involved in support/maintenance of gcc or llvm) have relatively few customers (e.g., big companies paying them lots of money for specific reasons) and as always these customers want existing code to continue to work. If the customer is paying for a code generator for a previously unsupported processor then there probably isn’t any existing code for that processor; it is a fact of life that porting source to a new processor always involves work. Some Linux distributors (e.g., Suse and Redhat) are customers in the sense that they pay the salary of developers who spend a lot of their time in compiler maintenance/upgrades and presumably work to try and ensure that the code in their respective Linux distributions does not get broken.
Compiler users who are not customers don’t count on the code breakage front (well, count for very very little, if an update broke lots of different developers’ code and enough fuss was made there might well be an update than unbroke the previous one).
What can a user do if code that used to work ok is broken when compiled with a later version of the compiler? The obvious answer is to continue using the older version that produces the desired behavior, fixing the code causing the problem is a better answer (but might involve a lot of work). There is no point in flaming the compiler developers, you are not contributing towards their upkeep; Open source does not give users the consideration that a customer enjoys.
The oldest compiler still in production use is?
What is the oldest compiler still in production use?
A CHILL compiler I worked on a long time ago has probably been in production use for 30 years now.
Code gets added and deleted from production software all the time, how might ‘oldest’ be measured? I propose using the mean age of every line of code, including comments, where the age of a line is reset to zero when it is modified in any way (excluding code formatting).
The following are two environmental factors that enable a production compiler to get very old:
- a relatively obscure language: popular languages have new compilers written for them (compiler death through competition) or have new features added to them (requiring new lines of code which could even displace ‘aged’ code),
- a very long-lived application associated with the language: obscure languages tend to be very quickly abandoned in the dust of history unless they have a symbiotic relationship with an important application,
- very long-lived host hardware and target processor: changing either often requires substantial new code or a move to a newer compiler. For ancient the only candidate is the IBM 370 and just really old the Intel 80×80, Zilog Z80.
Virtual machines provide a mechanism to be host hardware independent. The Micro Focus Cobol had a rewrite in the early 1990s (it might have had others since) and I don’t think UCSD Pascal I.5 is still used for production work.
Fortran is an evolving language and very popular in some application domains. I doubt there are any (mean age) old Fortran compilers in production use.
Why do I put forward the ITT (in its International Telephone & Telegraph days, these bits subsequently sold off) CHILL compiler as potentially the oldest compiler currently in production use?
- Obscure language and long-lived application (telephone switching software),
- host hardware was IBM 370 family, target processor Intel 8086 (later updated to support 80386),
- large development team and very small support team (i.e., lots of old code and small changes over the years),
- single customer, i.e., no push to add features to attract new customers or keep existing ones.
My last conversation with anybody associated with this compiler was a chance meeting over 10 years ago, so I might be a bit out of date.
30+ year old source code for compilers can be downloaded (e.g., the original PDP 11 C compiler) but these compilers are not in production use (forgotten about military installations anybody?)
I welcome other proposals for the oldest compiler currently in production use.
Recent Comments