Source code will soon need to be radiation hardened
I think I have discovered a new kind of program testing that may soon need to be performed by anybody wanting to create ultra-reliable software.
A previous post discussed the compiler related work being done to reduce the probability that a random bit-flip in the memory used by an executing program will result in a change in behavior. At the moment 4G of ram is expected to experience 1 bit-flip every 33 hours due to cosmic rays and the rate of occurrence is likely to increase.
Random corrupts on communications links are protected by various kinds of CRC checks. But these checks don’t catch every corruption, some get through.
Research by Artem Dinaburg looked for, and found, occurrences of bit-flips in domain names appearing within HTTP requests, e.g., a page from the domain ikamai.net
being requested rather than from akamai.net
. A subsequent analysis of DNS queries to VERISIGN’S name servers found “… that bit-level errors in the network are relatively rare and occur at an expected rate.” (the bit error rate was thought to occur inside routers and switches).
Javascript is the web scripting language supported by all the major web browsers and the source code of JavaScript programs is transmitted, along with the HTML, for requested web pages. The amount of JavaScript source can dwarf the amount of HTML in a web page; measurements from four years ago show users of Facebook, Google maps and gmail receiving 2M bytes of Javascript source when visiting those sites.
If all the checksums involved in TCP/IP transmission are enabled the theoretical error rate is 1 in bits. Which for 1 billion users visiting Facebook on average once per day and downloading 2M of Javascript per visit is an expected bit flip rate of once every 5 days somewhere in the world; not really something to worry about.
There is plenty of evidence that the actual error rate is much higher (because, for instance, some checksums are not always enabled; see papers linked to above). How much worse does the error rate have to get before developers need to start checking that a single bit-flip to the source of their Javascript program does not result in something nasty happening?
What we really need is a way of automatically radiation hardening source code.
Ethernet frames contain a 32 bit CRC “Frame Check Sequence”, on top of this the IP header contains a 16 bit checksum, and the TCP also has a 16 bit checksum. On top of that, communication systems themselves (e.g. ADSL modems) usually employ Forward Error Correction (e.g. Hamming codes, Viterbi, LDPC, Turbo codes, etc).
You say that this results theoretically in one error for every 1E17 bits, however that’s not true. Actually, if the bit errors were high we would see a lot of packet loss and failed CRC’s etc, (interfaces have counters to keep track of bad packets) but we don’t see that. We see consistently low packet loss. In addition, the analysis of the Ethernet CRC indicates that single bit errors and pairs of two bit errors are already reliably detected, so we would have to get some larger bundle of errors that just happens to match all the CRC’s and checksums, but also did not happen by brute force random chance so we never got a warning caused by high packet loss.
I think I can confidently say that the probability is rather low, and not worth worrying about. I think that “ikamai.net” is more likely to be someone who just didn’t know how to spell “akamai.net” than any network error.
@Tel
Networks do see ‘lots’ of checksum failures and stuff gets resent. The figure is for an undetected bi flip to get through undetected.
I think your confidence would be very dented if you read the linked to papers. I found their analysis convincing.