Evidence for the benefits of strong typing, where is it?
It is often claimed that writing software using a strongly typed programming language bestows worthwhile benefits. Those making the claims can sometimes be rather vague about exactly what the benefits are, while at other times appear willing to claim almost any benefit. What does the empirical evidence have to say (let’s ignore the what languages are strongly typed elephant in the room)?
Until recently there had been two empirical studies (plus a couple of language comparison experiments; one of the better ones involves the researcher timing himself implementing various algorithms in various languages; Zislis “An Experiment in Algorithm Implementation”), while in the last few years a group has been experimenting away in Germany (three’ish published data sets).
Measuring changes in developer performance caused by the use of different programming languages is very hard, some of the problems include:
- every person is different: a way needs to be found to take account of differences in subject ability/knowledge/characteristics,
- every problem is different: it may be easier to write a program to solve a problem using language X than using language Y,
- it is difficult to obtain experimental subjects.
The experimental procedure adopted by all the experiments discussed here is to:
- select two different languages or the same language modified to not support some type constructs,
- get students (mostly upper-undergraduates+graduates) to volunteer as experimental subjects,
- have each subject use one language to solve a problem and then use the other language to solve the same problem. Each subject is randomly assigned to a group using a given language order (the experiments start out with an equal number of subjects in each group, but not all subjects complete every problem),
- in some cases the previous step is repeated for new problems.
Having subjects solve the same problem twice creates the opportunity for learning to occur during the implementation of the first program and for this learning to improve performance during the second implementation. The experimental procedures employed generate information that can be used during the analysis of the data (in my case using a mixed-model in R; download code and all data) to factor this ordering effect into the created model.
So what are the results? In chronological order we have (if you know of anymore published work please tell me):
- Gannon “An Experimental Evaluation of Data Type Conversions”: Implemented compilers for two simple languages (think BCPL and BCPL+a string type and simple structures; by today’s standards one language is not quiet as weakly typed as the other). One problem had to be solved and this was designed to require the use of features available in both languages, e.g., a string oriented problem (final programs were between 50-300 lines). The result data included number of errors during development and number of runs needed to create a working program (this all happened in 1977, well before the era of personal computers, when batch processing was king).
There was a small language difference in number of errors/batch submissions; the difference was about half the size of that due to the order in which languages were used by subjects, both of which were small in comparison to the variation due to subject performance differences. While the language effect was small, it exists. To what extent Can the difference be said to be due to stronger typing rather than only one language having built in support for a string type? Who knows, no more experiments like this were performed for 20 years
- Prechelt & Tichy A Controlled Experiment to Assess the Benefits of Procedure Argument Type Checking: Used two C compilers, one K&R C (i.e., no argument checking of function calls) and the other ANSI C, with subjects solving one problem using both compilers; available output data was time taken by subjects to solve the problem.
Using the no argument checking compiler slowed implementation time by around 10%, about five times smaller than the variation in subject performance (there was an ordering effect of around 30%).
- Mayer, Kleinschmager & Hanenberg: Two experiments used different languages (Java and Groovy) and multiple problems; result data was time for subjects to complete the task (Do Static Type Systems Improve the Maintainability of Software Systems? An Empirical Study and An Empirical Study of the Influence of Static Type Systems on the Usability of Undocumented Software). No significant difference due to just language (surprisingly) but differences due to language order, but big differences due to language/problem interaction with some problems solved more quickly in Java and other more quickly in Groovy. Again large variation due to subject performance.
Another experiment used a single language (Java) and multiple problems involving making use of either Java’s generic types or non-generic types (“Do Developers Benefit from Generic Types?”). Again the only significant language difference effects occurred through interaction with other variables in the experiment (e.g., the problem or the language ordering) and again there were large variations in subject performance.
In summary, when a language typing/feature effect has been found its contribution to overall developer performance has been small.
I think some reasons that the effects of typing have been so small, or non-existent, include (I should declare my belief that strong typing is useful):
- the use of students as subjects. Most students have very little programming experience relative to professional developers (i.e., under 100 hours vs. thousands of hours). I can easily imagine many student subjects often finding the warnings produced by the type system more confusing than helpful. More experienced developers are in a position to make full use of what a type system offers, and researchers should try to use professional developers as subjects (it is not that hard to obtain such volunteers)
- the small size of the problems. Typing comes into its own when used to organize and control large amounts of code. I understand the constraints of running an experiment limit the amount of code involved.
Hey Derek,
ANSSI ran a study (as part of the LaFoSec project) where competing teams built software (an XML schema validator) in different languages, I think OCaml, C, and Scala. They compared the results and I think have some discussion about the merits of strong typing for security: http://www.ssi.gouv.fr/fr/anssi/publications/publications-scientifiques/autres-publications/lafosec-securite-et-langages-fonctionnels.html
We are also hopeful that builditbreakit.org will give some insight into differences between strong typing vs not in programmer tasks, but it depends entirely on the languages the contestants use!
@Andrew Ruef
Thanks for the reference. Looking through the reports, with my almost non-existent French, it looks like technical discussion about language constructs but no hard numbers on subject performance.
I saw the original announcement on builditbreakit and thought the requirements were very ambitious. I rely on users not frowning at the programs I write at Hackathons, otherwise they might crash. If you can measure developer effort (to a reasonable level of accuracy) then the results could be very interesting. On its own language/fault count is not that interesting.
“C++ versus Lisp: A Case Study”
Howard Tricky
ACM Sigplan Notices 23(2), pages 9-18 (1988)
See in particular the sections “Type Checking” and “Primitives”