Archive

Author Archive

Build an ISO Standard and the world will beat a path to your door

December 16, 2010 No comments

An email I received today, announcing the release of version 1.0 of the GNU Modula-2 compiler, reminded me of some plans I had to write something about a proposal to add some new definitions to the next version of the ISO C Standard.

In the 80s I was heavily involved in the Pascal community and some of the leading members of this community thought that the successor language designed by Niklaus Wirth, Modula-2, ought to be the next big language. Unfortunately for them this view was not widely shared and after much soul searching it was decided that the lack of an ISO standard for the language was responsible for holding back widespread adoption. A Modula-2 ISO Standard was produced and, as they say, the rest is history.

The C proposal involves dividing the existing definition undefined behavior into two subcategories; bounded undefined behavior and critical undefined behavior. The intent is to provide guidance to people involved with software assurance. My long standing involvement with C means that I find the technical discussions interesting; I have to snap myself out of getting too involved in them with the observation that should the proposals be included in the revised C Standard they will probably have the same impact as the publication of the ISO Standard had on Modula-2 usage.

The only way for changes to an existing language standard to have any impact on developer usage is for them to require changes to existing compiler behavior or to specify additional runtime library functionality (e.g., Extensions to the C Library Part I: Bounds-checking interfaces).

C is top dog in Google AI Challenge

December 4, 2010 No comments

The Google AI Challenge has just finished and the scores for the 4,619 entries have been made available, along with the language used. Does the language choice of the entries tell us anything?

The following is a list of the 11 most common languages used, the mean and standard deviation of the final score for entries written in that language and the number of entries written in the language:

           mean      sd    entries
C          2331     867       18
Haskell    2275     558       51
OCaml      2262     567       12
Ruby       2059     768       55
Java       2060     606     1634
C#         2039     612      485
C++        2027     624     1232
Lisp       1987     686       32
Python     1959     637      948
Perl       1957     693       42
PHP        1944     769       80

C has the highest mean and Lisp one of the lowest; is this a killer blow for Lisp in AI? (My empirical morality prevented me omitting the inconveniently large standard deviations.) C does have the largest standard deviation; is this because of crashes caused by off-by-one/null pointer errors or lower ability on the authors part? Not nearly enough data to tell.

I am guessing that this challenge would be taken up by many people in their 20s, which would explain the large number of entries written in Java and C++ (the most common languages taught in universities over the last few years).

I don’t have an explanation for the relatively large number of entries written in Python.

How to explain the 0.4% of entries written in C and it top placing? Easy, us older folk may be a bit thin on the ground but we know what we are doing (I did not submit an entry).

Categories: Uncategorized Tags:

Flawed analysis of “one child is a boy” problem?

November 6, 2010 5 comments

A mathematical puzzle has reappeared over the last year as the topic of discussion in various blogs and I have not seen any discussion suggesting that the analysis appearing in blogs contains a fundamental flaw.

The problem is as follows: I have two children and at least one of them is a boy; what is the probability that I have two boys? (A variant of this problem specifies whether the boy was born first or last and has a noncontroversial answer).

Most peoples (me included) off-the-top-of-the-head answer is 1/2 (the other child can be a girl or a boy, assuming equal birth probabilities {which is very a good approximation to reality}).

The analysis that I believe to be incorrect goes something like the following: The possible birth orders are gb, bg, bb or gg and based on the information given we can rule out girl/girl, leaving the probability of bb as 1/3. Surprise!

A variant of this puzzle asks for the probability of boy/boy given that we know one of the children is a boy born on a Tuesday. Here the answer is claimed to be 13/27 (brief analysis or using more complicated maths). Even greater surprise!

I think the above analysis is incorrect, which seems to put me in a minority (ok, Wikipedia says the answer could sometimes be 1/2). Perhaps a reader of this blog can tell me where I am going wrong in the following analysis.

Lets call the known boy B, the possible boy b and the possible girls g. The sequence of birth events that can occur are:

Bg gB bB Bb gg

There are four sequences that contain at least one boy, two of which contain two boys. So the probability of two boys is 1/2. No surprise.

All of the blog based analysis I have seen treat the ordering of a known boy+girl birth sequence as being significant but do not to treat the ordering of a known boy+boy sequence as significant. This leads them to calculate the incorrect probability of 1/3.

The same analysis can be applied to the “boy born on a Tuesday” problem to get the probability 14/28 (i.e., 1/2).

Those of you who like to code up a problem might like to consider the use of a language like Prolog which I would have thought would be less susceptible to hidden assumptions than a Python solution.

Categories: Uncategorized Tags:

Ordering of members in Oracle/Google Java copyright lawsuit

November 2, 2010 No comments

Oracle have specified some details in their ‘Java’ copyright claims against Google. One item that caught my eye was the source of a class copyrighted by Oracle and the corresponding class being used by Google. An experiment I ran at the 2005 and 2008 ACCU conferences studied how developers create/organize data structures from a given specification and sheds some light on developer behavior in this area (I am not a lawyer and have never been involved in a copyright case, so I will not say anything about legal issues).

Two of the hypothesis investigated by the ACCU experiment were, 1) within aggregate types (e.g., structs or classes) members having the same type tend to be placed together (e.g., all the chars followed by all the ints followed by all the floats), and 2) the relative ordering of members follows the order of the corresponding items in the specification (e.g., if the specification documents the date and then the location and this information occurs within the same structure the corresponding date member will appear before the location member).

The first hypothesis was investigated by analyzing C source and found that members having the same type are very likely to be grouped together. Looking at the private definitions in the Oracle code we see that members having a HashSet or boolean type are not grouped together with other members of the same type; one possibility was that the class grew over time and new members were appended rather than intermixed with members already present. The corresponding Google code also has the HashSet or boolean intermixed with members having different types. At least one implementation exists where members having the same type are grouped together.

The second hypothesis was investigated by running an experiment which asked developers to create an API from a specification, with everybody seeing the same specification except the ordering of the information was individually randomized for each developer. The results showed that when creating a struct or class (each subject was allowed to create the API in a language of their choice, with C, C++ and Java actually being used) developers tended to list members in the same relative order as the information appeared in the specification.

If the Apache implementors of the code used by Google based their specification on the Oracle code, it is to be expected that the ordering of the members would be very similar, which is it.

There has been a lot of talk of clean room implementation being used for the Java-like code within Dalvik. Did somebody sit down and write a specification of the Oracle PolicyNode class that was then given to somebody else to implement?

The empirical evidence (i.e., members having the same type are not adjacent but intermixed with other types and relative member ordering is the same) suggests that the implementation of at least the private members of the PolicyNode class being used by Google was based directly on the Oracle code.

These members are defined to be private so Google cannot argue that they are part of an API (various people are claiming that being part of an API has some impact on copyright, I have no idea if this is true).

Can Google get around any copyright issue simply by reordering the member definitions? I have no idea.

It would appear that like the patent lawsuit this copyright lawsuit could prove to be very interesting to watch.

Categories: Uncategorized Tags:

Language vulnerabilities TR published

October 28, 2010 No comments

ISO has just published the first Technical Report from the Programming Language Vulnerabilities working group (ISO/IEC TR 24772 Information technology — Programming languages — Guidance to avoiding vulnerabilities in programming languages through language selection and use). Having worked on programming language coding guidelines for many years, I was an eager participant during the first few years of this group. Unfortunately, this committee succumbed to an affliction that affects many groups working in this area, a willingness to uncritically include any reasonable sounding proposal that came its way.

Coding guidelines are primarily driven by human factors with implementation issues coming to the fore in some situations and guideline selection involves a cost/benefit trade-off. Early versions of the TR addressed these topics directly. As work progressed many members of WG23 (our official designation is actually ISO/IEC JTC 1/SC 22/WG 23) seemed to want a concrete approach, i.e., an unstructured list of guidelines, rather than a short list of potential problem areas derived by careful analysis.

A major input came from Larry Wagoner’s analysis of fault reports from various sources; he produced a list of the most common problems, and it was generally agreed that these be used as the base from which to derive the general guidelines.

Work began on writing the text to cover the faults listed by Larry, coalescing similar constructs where appropriate. Many of the faults occurred in programs written in C and some attempt was made to generalize the wording to be applicable to other languages (the core guidelines were intended to be language independent).

Over time suggestions were sent to the group’s convener and secretary for consideration (these came from group members and interested outsiders). Unfortunately, many of these proposals were accepted for inclusion without analysing the likelihood of them being a common source of faults in practice or the possibility that use of alternative constructs would be more fault-prone.

Some members of the group were loath to remove text from the draft document if it was decided that the issue discussed was not applicable. This text tended to migrate to a newly created Section 7 “Applications Issues”, a topic outside our scope of work).

The resulting document is a mishmash of guidelines likely to be relevant to real code and others that are little more than platitudes (e.g., ‘Demarcation of Control Flow’ and ‘Structured Programming’) or examples of very specific problems that occasional occur (i.e., not common enough to be worth mentioning such as ‘Enumeration issues’).

I eventually threw up my hands in despair at trying to stem the flow of ‘good’ ideas that kept being added, and stopped being actively involved.

Would I recommend TR 24772 to people? It does contain some very useful material. The material I think should have been culled is generally not bad, it just has a very low value and for most developers is probably not worth spending time on.

Footnote. ISO makes its money from selling standards and does not like to give them away for free. Those of us who work on standards want them to be used and would like them to be freely available. There is a precedent for TRs being made freely available, and hopefully this will eventually happen to TR 24772. In the meantime, there are documents on the WG23 web site that essentially contain all of the actual technical text.

Categories: Uncategorized Tags:

Predictor vs. control metrics

October 3, 2010 1 comment

For some time I have been looking for a good example to illustrate the difference between predicting some software attribute and controlling that attribute. A recent study comparing two methods of predicting a person’s height looks like it might be what I am after.

The study compared the accuracy of using information derived from a person’s genes to information on their parents’ height.

If a person is not malnourished or seriously ill while growing, then their genes control how tall they will finish up as an adult. Tinker with the appropriate genes while a person is growing and their final height will be different; tinker after they have reached adulthood and height is unlikely to change,

Around 150 years ago Francis Galton developed a statistical technique that enabled a person’s final adult height to be estimated from their parent’s height. Parent height is a predictor, not a controller, of their children’s height.

A few metrics have been found empirically to correlate with faults in code and a larger number, having no empirical evidence of a correlation, have been proposed as fault ‘indicators’. Unfortunately many software developers and most managers are blind to the saying “correlation does not imply causation”.

I’m sure readers would agree that once a baby is born changes to their parents height has no effect on their final adult height. Following this line of argument we would not expect changing source code to reduce the number of faults predicted by some metric to actually reduce the number of faults (ok, fiddling with the source might result in a fault being spotted and fixed).

The interesting part of the study was the finding that the prediction based on existing knowledge about the genetics of height explained 4-6% of subjects’ sex- and age-adjusted height variance, while Galton’s method based on parent height explained 40%. Today, 60 years after the structure of DNA was discovered our knowledge of what actually controls height is nowhere near as encompassing as an attribute that is purely a predictor.

Today, do we have have a good model of what actually happens when code is written? I think not.

How much time and money is being wasted changing software to improve its fit to predictor metrics rather than control metrics (I have little idea what any of these might be)? I dread to think.

Categories: Uncategorized Tags:

So what if it is undecidable or NP!

September 19, 2010 3 comments

An all too frequent refrain I hear when discussing how to solve a problem is “… but it is undecidable” and this is often followed by the statement that it is not worth attempting to solve the problem. Sometimes NP-completeness gets mentioned in the mix of mumbo-jumbo that is talked about this such problems.

Just because a problem is undecidable does not mean that developers should not attempt to write code to solve it. When a problem is undecidable there exists at least one set of inputs for which the algorithm you use cannot be relied on to give a correct answer and however much you rework the algorithm at least one such set of input will always exist (this set of input may change from algorithm to algorithm).

Undecidability theorems do not say anything about what percentage of all possible input sets have this ‘not known if answer is correct’ property. If there are other theorems addressing this question I don’t know about them (and I don’t claim to be an expert or up to date on this topic).

Mathematicians (well at least those working on Zermelo-Fraenkel set theory) spend lots of time trying to find problems that are undecidable and would be overjoyed to find an instance of a problem/dataset where undecidability actually occurred.

Undecidability is an interesting property; however I think educationalists should stop overselling it and give their students some practical advice such as “Don’t worry, these situations are extremely rare and perhaps many orders of magnitude less likely to generate a fault than the traditional fault generation methods.”

NP’ness deals with how execution time grows as the problem input size grows. It is easy to jump to the conclusion that problems known to be in P are ‘good’ and those believed to be in NP are ‘bad’; and that NP problems take so long to solve they are not worth consideration. For small input sets the constant multipliers in the non-polynomial time equation may mean that the actual execution time is acceptable; this is something that requires a bit of experimentation to find out.

Sometimes a NP problem can be solved in an acceptable amount of time.

The reverse situation is also true. While polynomial execution time does not grow as fast as non-polynomial (which could be exponential or worse) execution time, for small input sets the constant multipliers in the polynomial time equation may result in it starting off with a huge overhead, with problem size only becoming a significant factor for very large inputs.

Sometimes P problems cannot be solved in an acceptable amount of time.

Thinking in R: vectors

September 4, 2010 No comments

While I have been using the R language/environment for over five years or so, whenever I needed to do some statistical calculations, until recently I was only a casual user. A few months ago I started to use R in more depth and made a conscious decision to try and do things the ‘R way’.

While it is claimed that R is a functional language with object oriented features these are sufficiently well hidden that the casual user would be forgiven for thinking R was your usual kind of domain specific imperative language cooked up by a bunch of people who had not done this sort of thing before.

The feature I most wanted to start using in an R-like way was vectors. A literal value in R is not a scalar, it is a vector of length 1 and the built-in operators take vector operands and return a vector result, e.g., the result of the multiplication (c is the concatenation operator and in the following case returns a vector of containing two elements)

> c(1, 2) * c(3, 4)
[1] 3  8
>

The language does contain a for-loop and an if-statement, however the R-way is to use the various built-in functions operating on vectors rather than explicit loops.

A problem that has got me scratching my head for a better solution than the one I have is the following. I have a vector of numbers (e.g., 1 7 1 2 9 3 2) and I want to find out how many of each value are contained in the vector (i.e., 1 2, 2 2, 3 1, 7 1, 9 1).

The unique function returns a vector containing one instance of each value contained in the argument passed to it, so that removes duplicates. Now how do I get a count of each of these values?

When two vectors of unequal length are compared the shorter operand is extended (or rather the temporary holding its value is extended) to contain the same number of elements as the longer operand by ‘reusing’ values from the shorter operand (i.e., when indexing reaches the end of a vector the index is reset to the beginning and then moves forward again). This means that in the equality test X == 2 the right operand is extended to be a vector having the number of elements as X, all with the value 2. Some of these comparisons will be true and others false, but importantly the resulting vector can be used to index X, i.e., X[X == 2] returns a vector of 2’s, one for each occurrence of 2 in X. We can use length to obtain the number of elements in a vector.

The following function returns the number of instances of n in the vector X (now you will see why thinking the ‘R way’ needs practice):

num_occurrences = function(n)
   {
   length(X[X == n])  # X is a global variable
   }

My best solution to the counting problem, so far, uses the function lapply, which applies its second argument to every element of its first argument, as follows:

unique_X = unique(X)  # I know, real staticians use <- for assignment
X_counts = lapply(unique_X, num_occurrences)

Using lapply feels like a bit of a cheat to me. Ok, the loop is in compiled C code rather than interpreted R code, but an R function is being called for every unique value.

I have been rereading Matter Computational which has gotten me looking for a solution like those in the first few chapters of this book (and which will probably be equally obscure).

Any R experts out there with suggestions for a non-lapply solution?

Categories: Uncategorized Tags:

Oracle/Google ‘Java’ patents lawsuit

August 15, 2010 4 comments

At the technical level the Oracle ‘Java’ lawsuit against Google is not really about Java at all. The patents cited in the lawsuit (5,966,702, 6,061,520, 6,125,447, 6,192,476, RE38,104, 6,910,205 and 7,426,720) involve a variety of techniques that might be used in the implementation of a virtual machine supporting JIT compilation for software written in any language; Oracle does not get into the specifics of which parts of Android make use of these patented techniques (Google has not released the Dalvik source code). With regard to the claimed copyright infringement, I have no idea what Oracle claim has been copied; as I understand it Android uses Apache’s Harmony code for its Dalvik VM runtime library, but there is a possibility that it might have used some of Oracle’s gpl’ed code.

No doubt Google’s lawyers will be asking for specific details and once these are provided they have three options: 1) find prior art that invalidates the patent, 2) recode the implementation so it does not infringe the patent or 3) negotiate a license with Oracle.

My only experience in searching for prior art was as an adviser to the Monitoring Trustee appointed by the European Commission in the EU/Microsoft competition court case; Microsoft’s existing patents were handled by a patent lawyer and I looked at some of the non-patented innovations being claimed by Microsoft.

The obvious tool to use in a search for prior art is Google and this worked well for me provided the information was created after the mid to late 1990s. The cut-off date for historical searches seemed to be around 1993 (this all happened before Google Books was released). Other sources of information included old books (I had not moved in 15 years and my book collection had not been culled), computer manuals and one advisor found what he was looking for in some old computer magazines bought on e-bay.

Some claimed innovations appeared so blindingly obvious that it was difficult to believe that anybody would bother mentioning it in a print. Oracle’s patent 6,910,205
“Interpreting functions utilizing a hybrid of virtual and native machine instructions” (actually a continuation of 6,513,156, filed five years earlier in 1997) describes an obvious method of handling the interface between interpreted and JIT compiled code.

My only implementation experience in this area does predate the Oracle patent (I wrote a native code generator for the UCSD Pascal p-code machine in the mid-80s) but the translation occurred before program execution and so is not appropriate prior art. Virtual machines fell out of favor in the late 80s and when they were revived by Java 10 years later I was no longer working on compilers

Fans of Lisp are always claiming that everything was first done by the Lisp community and there were certainly Lisp compilers around in the 1980s, but I’m not sure if any of them worked during program execution.

Of course Groklaw have started following this case. Kodak has shown that it is possible to win in court over Java related patents. The SunOracle/Microsoft Java lawsuit in the late 1990s was a contract dispute and not about patents/copyright, but still about money/control.

I’m not aware of any previous language implementation patent lawsuits that have gone to court. It will be interesting to see how much success Google have in finding prior art and whether it is possible to implement a JVM that does not make use of any of the techniques specified in the Oracle patents (I suspect that they have a few more that they have not yet cited).

Categories: Uncategorized Tags: ,

Behavior patterns in Colonel Blotto

August 11, 2010 No comments

People look for patterns and expect others to follow patterns of behavior. Some preliminary data from a game of Colonel Blotto run by a group of Cornell University graduate students provides a vivid example of the existence of these behavior patterns.

In the Cornell implementation of the Colonel Blotto game, there are 10 fields and 100 soldiers. A player specifies how many soldiers are posted to each of the 10 fields. Players don’t know what the opposing players will do. In each field the battle is decided by whoever has the most soldiers with equal numbers resulting in a draw. Whoever wins the most battles wins the war.

One possible soldiers per field sequence is 10 10 10 10 10 10 10 10 10 10 and another is 1 11 11 11 11 11 11 11 11 11. The second sequence will lose in the first field, but win the other 9, and therefore win the war. A third sequence is 9 11 12 12 12 12 8 8 8 8 which beats the second sequence but only draws against the first.

The number of ways of placing s indistinguishable soldiers into f fields is:

(matrix{2}{1}{{s + f -1}  {f - 1}})

( matrix{2}{1}{109 9} )  =  4263421511271  approx  4.3  10^12

The following original analysis assumes that the soldiers are distinguishable, which they are not and so the number calculated is way too large (thanks to Forbes at The Virtuosi for pointing this out): There are 10 ways of placing the first soldier, 10 ways of placing the second and so on, giving 10^100 possible arrangements.

Some arrangements obviously always lose (25 25 25 25 0 0 0 0 0 0) and won’t be played; it seems reasonable to assume that no field will contain zero (I will talk about putting 0 in one field later), so having a field contain 1 is a draw at best. Lets assume that all fields contain at least 2 soldiers, which means we have 80 left to arrange giving 10^80 possibilities; a very big drop in possibilities but still a huge number.

( matrix{2}{1}{89 9} )  =  635627275767  approx  6.4  10^11

The more soldiers appearing in a field the more likely a win will occur there, but after a while the law of diminishing returns kicks in and losses occurring in understaffed fields will be greater. Lets assume that people decide never to put more than 20 soldiers in a field; how many possibilities does this rule out? If one field initially contains 21 soldiers and the others 2 each the remaining 61 soldiers create 10^61 unused possibilities, putting 21 soldiers in the second field generates a bit less than 10^61 new possibilities; lets round up and say there are a total of 10^62 unused possibilities. Compared to 10^80 the unused 10^62 is small enough to be ignored.

10^80 is such a big number that Chimpanzees randomly typing away since the dawn of time are unlikely to create any discernible pattern in the landscape.

If the players in this game had given random answers what would be the mean number of points scores and its variance?

Comparing the soldier sequence given by player A against the sequence given by player B produces 11 possible scores, {0, 10}, {1, 9} … {5, 5} … {10, 0}, five of them resulting in a win for A, five in a win for B and one in a draw. With 2 points for a win, 1 point for a draw and 0 points for a loss the expected number of points for each of the two players is 1, so for N players the expected mean of each player is N. In a game containing 550 players (the number that had played when I extracted the score results today; 347 had played in the preliminary Cornell data set) we would expect the average score to be 550.

The variance can be calculated from:

Var(X) = E[X^2]  -  E[X]^2

E[X^2] = (5/11) 0^2 + (1/11) 1^2 + (5/11) 2^2 = 21/11

Var(X) = 21/11 - 1^2 = 10/11

The standard deviation about our mean is sqrt{variance * players} or 22.4.

Plotting a histogram of points scored, at the time of writing, we get the following:

Histogram for player points.

This has a peak at 550 where we would expect it and its width is two standard deviations 🙂 However those ‘side-lobes’ were not expected and they are asymmetric. I think the reason for these unexpected ‘side-lobes’ is that some players give multiple answers, attempting to improve their ranking (some strategy name are suggestive of retries with modified strategies) and consequently decreasing the total points of other players. A search space of 10^80 sequences is so large that we would not expect players to be able to learn from the results of previously submitted sequences unless many of the previously submitted sequences followed some pattern that they could recognise.

What pattern of behavior might players follow? I can imagine players starting low/high and then shifting to high/low, perhaps following this sequence two or more times.

A preliminary summary of some of the data seen so far by Alex Alemi gave me enough information to figure out a sequence, in three attempts, that ranked 1st (a typo on my part prevented it being on the second attempt); my sequence of player names is averages, aver_p6, low_high_dip, followed by a bit of tuning: low_high_dipish, lohilohi_slide, followed by two experiments: the_well, heat_map, and user fooly started getting close and I tuned the sequence a bit more lohilohi_sliiide.

I used the sequences from the top 25 ranked players for my analysis and did not make any attempt to deduce other player patterns. My first attempt averaged the number of soldiers in each field, ranked 62nd as I write (aver_p6). Then I spotted that several distribution patterns existed, I picked what looked like the most promising and averaged over it (by eye) to get the then current to number 1 (currently languishing in 8th).

Had I put a bit more thought into things I could also have used the heat map near the bottom of the post (thanks to Alex Alemi for letting me post a copy here; “With Y soldiers in slot X, what percentage of the existing strategies do you beat in slot X.”):

Colonel Blotto winning soldiers per field heat map.

This shows one pattern (in light blue) where players are following the pattern low-high-low-high-low with the second low being very precipitous. The patterns visible in green are more complicated, but still contain a recurring low-high theme.

I know next to nothing about autocorrelation and so do not have any bright ideas about how to use of this data to improve my score. The heat_map sequence (7 8 9 12 15 16 9 11 7 6) attempts to follow the light blue pattern and is ranked 296 as I write, just under half way down the rankings and something to be expected from following an average.

Once a major pattern has been deduced ‘extra’ soldiers will be needed to outnumber the other players most of the time. One option is to be willing to always lose in one field (e.g., the one usually requiring the most soldiers) and use the freed up resources to win elsewhere. I have only made one attempt using this strategy (the_well: 5 6 7 11 19 0 14 20 13 5, ranked 72 as I write).

It is possible that common patterns of behavior are created through reinforcement. Early players create an initial configuration of results and slightly later players attempt to find it, perhaps visualizing a pattern where none actually exists. If enough people follow the same trail they will create the pattern that others can will try to follow. Results from other Colonel Blotto games exhibit a pattern of behavior that is different from the one in the Cornell dataset (I have only seen what is publicly available)

Categories: Uncategorized Tags: