Increase your citation count, send me your data!
I regularly email people asking for a copy of the data used in a paper they wrote. In around 32% of cases I don’t get any reply, around 12% promise to send the data when they are less busy (a few say they are just too busy) and every now and again people ask why I want the data.
After 6-12 months, I email again saying that I am still interested in their data; a few have replied with apologies and the data.
I need a new strategy to motivate people to spend some time tracking down their data and sending it to me; there are now over 200 data-sets possibly lost forever!
I think those motivated by the greater good will have already responded. It is time to appeal to baser instincts, e.g., self-interest. The currency of academic life is paper citations, which translate into status in the community, which translate into greater likelihood of grant proposals being accepted (i.e., money to do what they want to do).
Sending data gets researchers one citation in my book (I am ruthless about not citing a paper if I don’t get any data).
My current argument is that once their data is publicly available (and advertised in my book) lots of other researchers will use it and more citation to their work will follow; they also get an exclusive, I only use one data-set for each topic (actually data is hard to get hold of, so the exclusivity offer is spin).
To back up my advertising claims, I point out that influential people are writing about my book and it’s all over social media. If you want me to add you to the list of influential people, send me a link to what you have written (I have no shame).
If you write about my book, please talk about the data and that researchers who make their data public are the only ones who deserve to fund and may citations rain down on them.
That is the carrot approach, how can I apply some stick to motivate people?
I could point out that if they don’t send me their data their work is doomed to obscurity, because I will use somebody else’s (skipping over the minor detail of data being hard to find). Research has found that people are less willing to share their data if the strength of the evidence is weak; calling out somebody like that is do-or-die.
If you write about my book, please talk about the data and point out that researchers who don’t make their data public have something to hide and should not be funded.
Since the start of 2017, researchers in the UK receiving government research grants are required to archive their data and make it available. This is good for future researchers, but not a lot of use for me now.
What do readers think? Ideas and suggestions welcome.
Whole-program optimization: there’s gold in them hills
Information is the life-blood of compiler optimization and compiler writers are always saying “If only this-or-that were known, we could do this optimization.”
Burrowing down the knowing this-or-that rabbit-hole leads to requiring that code generation be delayed until link-time, because it is only at link-time that all the necessary information is available.
Whole program optimization, as it is known, has been a reality in the major desktop compilers for several years, in part because computers having 64G+ of memory have become generally available (compiler optimization has always been limited by the amount of memory available in developer computers). By reality, I mean compilers support whole-program-optimization flags and do some optimizations that require holding a representation of the entire program in memory (even Microsoft, not a company known for leading edge compiler products {although they have leading edge compiler people in their research group} supports an option having this name).
It is still early days for optimizations that are “whole-program”, rather like the early days of code optimization (when things like loop unrolling were leading edge and even constant folding was not common).
An annoying developer characteristic is writing code that calls a function every five statements, on average. Calling a function means that all those potentially reusable values that have been loaded into registers, before the call, cannot be safely used after the call (figuring out whether it is worth saving/restoring around the call is hard; yes developers, it’s all your fault that us compiler writers have such a tough job :-P).
Solutions to the function call problem include inlining and flow-analysis to figure out the consequences of the call. However, the called function calls other functions, which in-turn burrow further down the rabbit hole.
With whole-program optimization, all the code is available for analysis; given enough memory and processor time, lots of useful information can be figured out. Most functions are only called once, so there are lots of savings to be had from using known parameter values (many are numeric literals) to figure out whether an if-statement is necessary (i.e., is dead code) and how many times loops iterate.
More fully applying known techniques is the obvious easy use-case for whole-program optimization, but the savings probably won’t be that big. What about new techniques that previously could not even be attempted?
For instance, walking data structures until some condition is met is a common operation. Loading the field/member being tested and the next
/prev
field, results in one or more cache lines being filled (on the assumption that values adjacent in storage are likely to be accessed in the immediate future). However, data structures often contain many fields, only a few of which need to be accessed during the search process, when the next value needed is in another instance of the struct
/record
/class
it is unlikely to already be available in the loaded cache line. One useful optimization is to split the data structure into two structures, one holding the fields accessed during the iterative search and the other holding everything else. This data-remapping means that cache lines are more likely to contain the next value accessed approaches increases the likelihood that cache lines will hold values needed in the near future; the compiler looks after the details. Performance gains of 27% have been reported
One study of C++ found that on average 12% of members were dead, i.e., never accessed in the code. Removing these saved 4.4% of storage, but again the potential performance gain comes from improve the cache hit ratio.
The row/column storage layout of arrays is not cache friendly, using Morton-order layout can have a big performance impact.
There are really really big savings to be had by providing compilers with a means of controlling the processor’s caches, e.g., instructions to load and flush cache lines. At the moment researchers are limited to simulations show that substantial power savings+some performance gain are possible.
Go forth and think “whole-program”.
How indeterminate is an indeterminate value?
One of the unwritten design aims of the C Standard is that it should be possible to fully implement the C Standard library in conforming C. It turned out that this was not possible in C90; the problem was implementing the memcpy
function when the object being copied was an object having a struct
type containing one or more padding bytes. The memcpy
library function copies the bytes in one object to another object. The padding bytes might be uninitialized (they have an indeterminate value), which means accessing them is undefined behavior (in C90), i.e., use of memcpy
for copying structs containing padding results in a non-conforming program.
struct { char c; // Occupies 1 byte // Possible padding bytes here int i; // A 2/4-byte int sometimes has to be aligned on a 2/4-byte storage boundary }; |
Padding bytes could be set to a known value by, for instance, using memcpy
to zero the storage; requiring this usage was thought to be excessive, and a hefty chunk of new words was added in C99 (some of the issues raised by this problem also cropped up elsewhere, which contributed to the will to do this).
One consequence of the new wording is that objects having type unsigned char
are special in that while their uninitialized value is still indeterminate, the possible set of values excludes a trap representation, they have an unspecified value making accesses unspecified behavior (which conforming programs can contain). The uninitialized value of objects having other types can be a trap representation; it’s the possibility of a value being a trap representation that makes accessing such uninitialized objects undefined behavior.
All well and good, memcpy
can now be implemented in conforming C(99) by copying unsigned chars.
Having made it possible for a conforming program to access an uninitialized object (having type unsigned char
), questions about it actual value can be asked. Its value is indeterminate you say, the clue is in the term indeterminate value. Ok, what does the following value function return?
unsigned char random(void) { unsigned char x; return x ^ x; } |
Exclusiving-oring a value with itself always produces zero. An unsigned char
taking, say, values 0 to 255, pick one and you always get zero; case closed. But where does it say that an indeterminate value is always the same value? There is no wording preventing an indeterminate value being different every time it is accessed. The sound of people not breathing could be heard when this was pointed out to WG14 (the C Standard’s committee), followed by furious argument on one side or the other.
The following illustrates one situation where the value of padding bytes could change with every access. That volatile
qualifier specifies that the value of c
could change between two accesses (e.g., it represents the storage layout of some memory mapped I/O device). Perhaps any padding bytes following it are also effectively volatile-qualified.
struct { volatile char c; // A changeable 1 byte // Possible padding bytes may be volatile int i; // No volatility here }; |
The local object x
, above, is not associated with a volatile-qualified object. But, so what? Another unwritten design aim of the C Standard is to keep the wording simple, so edge cases are not called out and the behavior intended to handle padding bytes gets applied to local unsigned chars.
A compiler could decide that calls to random
always return zero, based on the assumption that while indeterminate values may not be known, they are not time varying.
Experiment, replicate, replicate, replicate,…
Popular science writing often talks about how one experiment proved this-or-that theory or disproved ‘existing theories’. In practice, it takes a lot more than one experiment before people are willing to accept a new theory or drop an existing theory. Many, many experiments are involved, but things need to be simplified for a popular audience and so one experiment emerges to represent the pivotal moment.
The idea of one experiment being enough to validate a theory has seeped into the world view of software engineering (and perhaps other domains as well). This thinking is evident in articles where one experiment is cited as proof for this-or-that and I am regularly asked what recommendations can be extracted from the results discussed in my empirical software book (which contains very few replications, because they rarely exist). This is a very wrong.
A statistically significant experimental result is a positive signal that the measured behavior might be real. The space of possible experiments is vast and any signal that narrows the search space is very welcome. Multiple replication, by others and with variations on the experimental conditions (to gain an understanding of limits/boundaries), are needed first to provide confidence the behavior is repeatable and then to provide data for building practical models.
Psychology is currently going through a replication crisis. The incentive structure for researchers is not to replicate and for journals not to publish replications. The Reproducibility Project is doing some amazing work.
Software engineering has had an experiment problem for decades (the problem is lack of experiments), but this is slowly starting to change. A replication problem is in the future.
Single experiments do have uses other than helping to build a case for a theory. They can be useful in ruling out proposed theories; results that are very different from those predicted can require ideas to be substantially modified or thrown away.
In the short term (i.e., at least the next five years) the benefit of experiments is in ruling out possibilities, as well as providing general pointers to the possible shape of things. Theories backed by substantial replications are many years away.
Unappreciated bubble research
Every now and again an academic journal dedicates a single issue to one topic. I laughed when I saw the topic of an upcoming special issue on “Enhancing Credibility of Empirical Software Engineering”.
If you work in industry, you probably have a completely different interpretation of the intent of this issue, compared to somebody working in academia, i.e., you think the topic is about getting academic researchers to work on stuff of interest to industry. In academia the issue is about getting industry to treat the research work being done in universities as relevant to their needs, i.e., industry just does not appreciate how useful the work being done in universities is to solving real world problems.
Yes fellow industrialists, the credibility problem is all down to us not appreciating the work of those hard-working academics (I was once at a university meeting and the Dean referred to the industrialists at the meeting, which confused me because I did not know any were present; sometime later the penny dropped and I realised he was talking abut me and another guy who was working in industry).
The real problem is that most research academics have little idea what goes on in industry and what research results might be of interest to industry. This is not surprising given that the academic career ladder keeps people within the confines of the university bubble.
I regularly have academics express surprise that somebody in industry, i.e., me, knows about this-that-or-the-other. This baffled me for a while, until I realised that many academics really do regard people working in industry as simpletons; I now reply that its because I paid more for my degree and did not have the usual labotomy before graduating. Now they are baffled.
The solution to the problem of industrial research relevance is for academics to be willing to move outside the university bubble, to go out and interact with people in industry. However, there are powerful incentives pushing academics away from talking to industry:
- academic performance is measured by papers published and the chances of getting a paper published are improved if it involves a fashionable topic (yes fellow industrialists, academics suffer from this problem too). Stuff that industry is interested in is not fashionable, at least not yet. I don’t see many researchers being willing to risk working on very unfashionable topics in the hope that their work might get published,
- contact with industry will open the eyes of many academics to the interesting work being done there and the much higher paying jobs available (at least for those who are any good). Heads’ of department don’t want to lose their good people and have every incentive to discourage researchers having any contact with industry. The senior staff are sufficiently embedded in the system that they can be trusted to talk to industry, rather like senior communist party members being allowed to visit the West during the cold war.
An alternative way for academic research to connect with industry is for the research to be done by people with a lot of industry experience. There are a surprising number of people working in industry who are bored and are contemplating doing a PhD for something interesting to do (e.g., a public proclamation).
Again there are powerful incentives pushing against industry contact. PhD students do the academic grunt work and so compliant people are needed, i.e., recent graduates who will accept that this is how things work, not independent people who know better (such as those with a decent amount of industry experience). Worries about industrialists not being willing to tow-the-line with respect to departmental thinking are probably groundless, plenty of this sort of thing goes on in industry.
I found out at the weekend that only one central London university offers a computing related part-time PhD program (Birkbeck; few people can afford to a significant drop in income); part-time students are not around to do the grunt work.
Recent Comments