Researching programming languages
What useful things might be learned from evidence-based research into programming languages?
A common answer is researching how to design a programming language having a collection of desirable characteristics; with desirable characteristics including one or more of: supporting the creation of reliable, maintainable, readable, code, or being easy to learn, or easy to understand, etc.
Building a theory of, say, code readability is an iterative process. A theory is proposed, experiments are run, results are analysed; rinse and repeat until a theory having a good enough match to human behavior is found. One iteration will take many years: once a theory is proposed, an implementation has to be built, developers have to learn it, and spend lots of time using it to enable longer term readability data to be obtained. This iterative process is likely to take many decades.
Running one iteration will require 100+ developers using the language over several years. Why 100+? Lots of subjects are needed to obtain statistically meaningful results, people differ in their characteristics and previous software experience, and some will drop out of the experiment. Just one iteration is going to cost a lot of money.
If researchers do succeed in being funded and eventually discovering some good enough theories, will there be a mass migration of developers to using languages based on the results of the research findings? The huge investment in existing languages (both in terms of existing code and developer know-how) means that to stand any chance of being widely adopted these new language(s) are going to have to deliver a substantial benefit.
I don’t see a high cost multi-decade research project being funded, and based on the performance improvements seen in studies of programming constructs I don’t see the benefits being that great (benefits in use of particular constructs may be large, but I don’t see an overall factor of two improvement).
I think that creating new programming languages will continue to be a popular activity (it is vanity research), and I’m sure that the creators of these languages will continue to claim that their language has some collection of desirable characteristics without any evidence.
What programming research might be useful and practical to do?
One potentially practical and useful question is the lifecycle of programming languages. Where the components of the lifecycle includes developers who can code in the language, source code written in the language, and companies dependent on programs written in the language (who are therefore interested in hiring people fluent in the language).
Many languages come and go without many people noticing, a few become popular for a few years, and a handful continue to be widely used over decades. What are the stages of life for a programming language, what factors have the largest influence on how widely a language is used, and for how long it continues to be used?
Sixty years worth of data is waiting to be collected and collated; enough to keep researchers busy for many years.
The uses of a lifecycle model, that I can thinkk of, all involve the future of a language, e.g., how much of a future does it have and how might it be extended.
Some recent work looking at the rate of adoption of new language features includes: On the adoption, usage and evolution of Kotlin Features on Android development, and Understanding the use of lambda expressions in Java; also see section 7.3.1 of Evidence-based software engineering.
I firmly believe the next great step in programming language design will be automated analysis, proving and rewriting.
Current languages, C/C++ being the poster child here…. are horrendous to analyse, and horrible to prove and reason about, an terribly error prone to rewrite and refactor.
Modern industrial code bases are far larger than any single person can hope to comprehend. We need assistance. Whilst grep and semgrep enable some weak assertions to be made about an entire code base… they are weak and deeply flawed.
Type systems are useful, but again are weak proxies for what we really want. Design by Contract.
Ideally there should be a strong correspondence, for example, between the class invariants and instances of a class, in practice most language designs are provide little or no support.
Ideally if the programmer declares a precondition or post condition or invariant, the compiler should both …
* check and warn if they may be violated,
* and make use of the information given for optimization.
@John Carter
Providing language feature X, whose use provides various proven benefits, does not mean that developers will use said feature.
My experience with Pascal developers was that, many did not make use of the strong typing functionality provided when it required them to spend time thinking about what to do.
For instance, Pascal supported integer subranges, e.g., some_var :1..10; specifies that some_var holds values between one and ten (the compiler is required to do runtime checking). This kind of check means that some problems are quickly found and localised. But it requires thinking about the range of possible values, which means developers have to stop and think (always a good idea), and they don’t like it (much more fun to write the code and then spend hours debugging it at runtime).
The only solution is for there to be more people who can program looking for programming jobs than there are jobs. Once jobs changing jobs become difficult, developers will have to start paying attention to things that work.
You have correctly identified the main impediment to switching languages, namely “The huge investment in existing languages”, even when the existing legacy is bit-compatible. I would argue legacy conversion is Herculean. On the other CPU, however, conversion is possible in systems comprising numerous communicating processors.
Just a quick cheap shot: One thing we know about programming languages is that one major source of error (both in programming and reading code) is mis-guessing what operator precedence is going to do. But no one designs languages that don’t use operator precedence.
(While this is a cheap shot, it’s correct. It’s also an easy cheap shot for someone for whom the majority of their programming has been in Lisp. Programming without operator precedence is actually quite wonderful; you never get bitten by it.)
Serious question: is “easy to learn” a good thing? What are the costs/downsides? Does it even work? E.g. I’m doing a tiny amount of linguistics hacking (of Japanese) in Python (which claims to be easy to leran), and every time I need to write a loop, I have to go running to the manual. (I spend much more time running my code than writing it, so not a fair complaint, perhaps.)
(I have good intentions on commenting on your human cognition chapter but (a) things got nuts here, (b) I only recently figured out how to explain my complaints with it, and (c) the dog ate my homework.)
@David J. Littleboy
APL is probably the only ‘widely’ used language where all operators have the same precedence.
Not using operator precedence would solve one problem, but would it create others, e.g., mistakes caused by relying on order of evaluation. Also, school children are taught that multiple has higher precedence than addition, and it would be unwise to go against something so ingrained.
What is the lifecycle of language use? Learn it (short time), use it (much longer time), not use it (promoted to a level where this skill is not required).
Learnability is not really a big issue in industry. It is an issue in universities because there is demand from students wanting to learn to program, but given the steps taken to simplify the process to maximise throughput, I’m not sure that many of these student ‘programmers’ would be employed in this role (because of their inability to do anything without lots of hand-holding)..
@Derek Jones
“would it create others, e.g., mistakes caused by relying on order of evaluation”
Is APL still around? I remember the IBM APL typeballs from back in the day.
Ah, sorry. I’m advocating completely parenthesizing all expressions. You have to say exactly what you mean. No one will ever again think that X OR Y AND A OR Z should mean an AND of two OR-connected conditions. (This is a cognitive issue in programming: when you are thinking about a problem in which two (or more) conditions must be met, you can’t believe that the compiler could be so dumb as to not see that that’s what’s going on.)
(AND (OR X Y)
(OR A Z))
And to add insult to injury, there’s no good way to pretty-print syntaxy language code containing hairy expressions. Really, there’s not, since there’s no good place to put the operators.
((X OR Y)
AND
(A OR Z))
If the two subexpressions are complex (multi-line) expressions of different sizes, it can be hard to find the highest level operator.
(By the way, Japanese programmers prefer postfix notation, since Japanese is verb final. (Well, one Japanese postdoc at Yale did a personal project using postfix notation, and it irritated me no end that I didn’t figure out instantly why he was being so wacky.))
FWIW, the operator precedence taught children in school doesn’t really work: there’s been a recent internet meme in which different people get different values for an arithmetic expression that includes implicit multiplication.
But I do realize that in the real world, languages will continue to use operator precedence, and people will continue to misread and miswrite complex expressions unnecessarily. Lemmings.
Another learnability context: Since Python was the first language I ran across that handles Unicode strings transparently (Go and Julia want you to work with variable-with characters, a seriously insane idea), I’ve been seeing a lot of learnability in the context of folks who don’t want to do Comp. Sci. at uni, but want in on the software game as self-taught programmers. FWIW.
@David J. Littleboy
Yes, APL is still around, and there are modern variants like K; all members of what is known as an array processing language.
The book “Mind Bugs” by VanLehn builds various models for the mistakes children make when performing arithmetic; it contains lots of data arithmetic mistakes.
From the learnability perspective the Inform language is close to English. Whether this makes learning to program easier depends on the kinds of problems being solved.
I believe a lot of potentially very useful language-related research could look at language use in the field as opposed to the properties of the language itself.
For instance, a real instance of X OR Y AND A OR Z is likely more like
(error_count > 0 OR event_count == 0) AND (errors_reported == error_count OR silent)
and the more interesting question is whether the programmer will simplify the understanding of this logic by introducing additional descriptive names by means of intermediate variables.
Are programmers in language A doing this much more often than others in language B?
Why?
Is there generally a style with better understandability in community A?
If so, what have they done to get there?
Can other communities replicate this? Or find other ways that suit their culture?
@Lutz Prechelt
Yes, research on language use has lots of potential benefits, and is the subject of the source code chapter of my book.
Most languages have the same operator precedence as C (Fortran is the furthest away, and R because it follows Fortran), at least for the same operators.
I ran experiments in 2006 and 2007 to gain some evidence for the hypothesis that developer beliefs about operator precedence improves with practice (it did), and does naming play an important role (it did).