Archive

Archive for July, 2020

Surveys are fake research

July 26, 2020 No comments

For some time now, my default position has been that software engineering surveys, of the questionnaire kind, are fake research (surveys of a particular research field used to be worth reading, but not so often these days; that issues is for another post). Every now and again a non-fake survey paper pops up, but I don’t consider the cost of scanning all the fake stuff to be worth the benefit of finding the rare non-fake survey.

In theory, surveys could be interesting and worth reading about. Some of the things that often go wrong in practice include:

  • poorly thought out questions. Questions need to be specific and applicable to the target audience. General questions are good for starting a conversation, but analysis of the answers is a nightmare. Perhaps the questions are non-specific because the researcher is looking for direction: well please don’t inflict your search for direction on the rest of us (a pointless plea in the fling it at the wall to see if it sticks world of academic publishing).

    Questions that demonstrate how little the researcher knows about the topic serve no purpose. The purpose of a survey is to provide information of interest to those in the field, not as a means of educating a researcher about what they should already know,

  • little effort is invested in contacting a representative sample. Questionnaires tend to be sent to the people that the researcher has easy access to, i.e., a convenience sample. The quality of answers depends on the quality and quantity of those who replied. People who run surveys for a living put a lot of effort into targeting as many of the right people as possible,
  • sloppy and unimaginative analysis of the replies. I am so fed up with seeing an extensive analysis of the demographics of those who replied. Tables containing response break-down by age, sex, type of degree (who outside of academia cares about this) create a scientific veneer hiding the lack of any meaningful analysis of the issues that motivated the survey.

Although I have taken part in surveys in the past, these days I recommend that people ignore requests to take part in surveys. Your replies only encourage more fake research.

The aim of this post is to warn readers about the growing use of this form of fake research. I don’t expect anything I say to have any impact on the number of survey papers published.

Categories: Uncategorized Tags: ,

Effort estimation’s inaccurate past and the way forward

July 19, 2020 3 comments

Almost since people started building software systems, effort estimation has been a hot topic for researchers.

Effort estimation models are necessarily driven by the available data (the Putnam model is one of few whose theory is based on more than arm waving). General information about source code can often be obtained (e.g., size in lines of code), and before package software and open source, software with roughly the same functionality was being implemented in lots of organizations.

Estimation models based on source code characteristics proliferated, e.g., COCOMO. What these models overlooked was human variability in implementing the same functionality (a standard deviation that is 25% of the actual size is going to introduce a lot of uncertainty into any effort estimate), along with the more obvious assumption that effort was closely tied to source code characteristics.

The advent of high-tech clueless button pushing machine learning created a resurgence of new effort estimation models; actually they are estimation adjustment models, because they require an initial estimate as one of the input variables. Creating a machine learned model requires a list of estimated/actual values, along with any other available information, to build a mapping function.

The sparseness of the data to learn from (at most a few hundred observations of half-a-dozen measured variables, and usually less) has not prevented a stream of puffed-up publications making all kinds of unfounded claims.

Until a few years ago the available public estimation data did not include any information about who made the estimate. Once estimation data contained the information needed to distinguish the different people making estimates, the uncertainty introduced by human variability was revealed (some consistently underestimating, others consistently overestimating, with 25% difference between two estimators being common, and a factor of two difference between some pairs of estimators).

How much accuracy is it realistic to expect with effort estimates?

At the moment we don’t have enough information on the software development process to be able to create a realistic model; without a realistic model of the development process, it’s a waste of time complaining about the availability of information to feed into a model.

I think a project simulation model is the only technique capable of creating a good enough model for use in industry; something like Abdel-Hamid’s tour de force PhD thesis (he also ignores my emails).

We are still in the early stages of finding out the components that need to be fitted together to build a model of software development, e.g., round numbers.

Even if all attempts to build such a model fail, there will be payback from a better understanding of the development process.

No replies to 135 research data requests: paper titles+author emails

July 12, 2020 No comments

I regularly email researchers referring to a paper of theirs I have read, and asking for a copy of the data to use as an example in my evidence-based software engineering book; of course their work is cited as the source.

Around a third of emails don’t receive any reply (a small number ask why they should spend time sorting out the data for me, and I wrote a post to answer this question). If there is no reply after roughly 6-months, I follow up with a reminder, saying that I am still interested in their data (maybe 15% respond). If the data looks really interesting, I might email again after 6-12 months (I have outstanding requests going back to 2013).

I put some effort into checking that a current email address is being used. Sometimes the work was done by somebody who has moved into industry, and if I cannot find what looks like a current address I might email their supervisor.

I have had replies to later email, apologizing, saying that the first email was caught by their spam filter (the number of links in the email template was reduced to make it look less like spam). Sometimes the original email never percolated to the top of their todo list.

There are were originally around 135 unreplied email requests (the data was automatically extracted from my email archive and is not perfect); the list of papers is below (the title is sometimes truncated because of the extraction process).

Given that I have collected around 620 software engineering datasets (there are several ways of counting a dataset), another 135 would make a noticeable difference. I suspect that much of the data is now lost, but even 10 more datasets would be nice to have.

After the following list of titles is a list of the 254 author last known email addresses. If you know any of these people, please ask them to get in touch.

If you are an author of one of these papers: ideally send me the data, otherwise email to tell me the status of the data (I’m summarising responses, so others can get some idea of what to expect).

50 CVEs in 50 Days: Fuzzing Adobe Reader
A Change-Aware Per-File Analysis to Compile Configurable Systems
A Design Structure Matrix Approach for Measuring Co-Change-Modularity
A Foundation for the Accurate Prediction of the Soft Error
AGENT-BASED SIMULATION OF THE SOFTWARE DEVELOPMENT PROCESS: A CASE STUDY
A Large Scale Evaluation of Automated Unit Test Generation Using
A large-scale study of the time required to compromise
A Large-Scale Study On Repetitiveness, Containment, and
Analysing Humanly Generated Random Number Sequences: A Pattern-Based
Analysis of Software Aging in a Web Server
Analyzing and predicting effort associated with finding & fixing
Analyzing CAD competence with univariate and multivariate
Analyzing Differences in Risk Perceptions between Developers
Analyzing the Decision Criteria of Software Developers Based on
An analysis of the effect of environmental and systems complexity on
An Empirical Analysis of Software-as-a-Service Development
An Empirical Comparison of Forgetting Models
An empirical study of the textual similarity between
An error model for pointing based on Fitts' law
An Evolutionary Study of Linux Memory Management for Fun and Profit
An examination of some software development effort and
An Experimental Survey of Energy Management Across the Stack
Anomaly Trends for Missions to Mars: Mars Global Surveyor
A Quantitative Evaluation of the RAPL Power Control System
Are Information Security Professionals Expected Value Maximisers?:
A replicated and refined empirical study of the use of friends in
A Study of Repetitiveness of Code Changes in Software Evolution
A Study on the Interactive Effects among Software Project Duration, Risk
Bias in Proportion Judgments: The Cyclical Power Model
Capitalization of software development costs
Configuration-aware regression testing: an empirical study of sampling
Cost-Benefit Analysis of Technical Software Documentation
Decomposing the problem-size effect: A comparison of response
Determinants of vendor profitability in two contractual regimes:
Diagnosing organizational risks in software projects:
Early estimation of users’ perception of Software Quality
MEASURING USER’S PERCEPTION AND OPINION OF SOFTWARE QUALITY
Empirical Analysis of Factors Affecting Confirmation
Estimating Agile Software Project Effort: An Empirical Study
Estimating computer depreciation using online auction data
Estimation fulfillment in software development projects
Ethical considerations in internet code reuse: A
Evaluating. Heuristics for Planning Effective and
Explaining Multisourcing Decisions in Application Outsourcing
Exploring defect correlations in a major. Fortran numerical library
Extended Comprehensive Study of Association Measures for
Eye gaze reveals a fast, parallel extraction of the syntax of
Factorial design analysis applied to the performance of
Frequent Value Locality and Its Applications
Historical and Impact Analysis of API Breaking Changes:
How do i know whether to trust a research result?
How do OSS projects change in number and size?
How much is “about” ? Fuzzy interpretation of approximate
Humans have evolved specialized skills of
Identifying and Classifying Ambiguity for Regulatory Requirements
Identifying Technical Competences of IT Professionals. The Case of
Impact of Programming and Application-Specific Knowledge
Individual-Level Loss Aversion in Riskless and Risky Choices
Industry Shakeouts and Technological Change
Inherent Diversity in Replicated Architectures
Initial Coin Offerings and Agile Practices
Interpreting Gradable Adjectives in Context: Domain
Is Branch Coverage a Good Measure of Testing Effectiveness?
JavaScript Developer Survey Results
Knowledge Acquisition Activity in Software Development
Language matters
Learning from Evolution History to Predict Future Requirement Changes
Learning from Experience in Software Development:
Learning from Prior Experience: An Empirical Study of
Links Between the Personalities, Views and Attitudes of Software Engineers
Making root cause analysis feasible for large code bases:
Making-Sense of the Impact and Importance of Outliers in Project
Management Aspects of Software Clone Detection and Analysis
Managing knowledge sharing in distributed innovation from the
Many-Core Compiler Fuzzing
Measuring Agility
Mining for Computing Jobs
Mining the Archive of Formal Proofs.
Modeling Readability to Improve Unit Tests
Modeling the Occurrence of Defects and Change
Modelling and Evaluating Software Project Risks with Quantitative
Moore’s Law and the Semiconductor Industry: A Vintage Model
Motivations for self-assembling into project teams
Networks, social influence and the choice among competing innovations:
Nonliteral understanding of number words
Nonstationarity and the measurement of psychophysical response in
Occupations in Information Technology
On information systems project abandonment
On the Positive Effect of Reactive Programming on Software
ON THE USE OF REPLACEMENT MESSAGES IN API DEPRECATION:
On Vendor Preferences for Contract Types in Offshore Software Projects:
Peer Review on Open Source Software Projects:
Parameter-based refactoring and the relationship with fan-in/fan-out
Participation in Open Knowledge Communities and Job-Hopping:
Pipeline management for the acquisition of industrial projects
Predicting the Reliability of Mass-Market Software in the Marketplace
Prototyping A Process Monitoring Experiment
Quality vs risk: An investigation of their relationship in
Quantitative empirical trends in technical performance
Reported project management effort, project size, and contract type.
Reproducible Research in the Mathematical Sciences
Semantic Versioning versus Breaking Changes
Software Aging Analysis of the Linux Operating System
Software reliability as a function of user execution patterns
Software Start-up failure An exploratory study on the
Spatial estimation: a non-Bayesian alternative
System Life Expectancy and the Maintenance Effort: Exploring
Testing as an Investment
The enigma of evaluation: benefits, costs and risks of IT in
THE IMPACT OF PLANNING AND OTHER ORGANIZATIONAL FACTORS
The impact of size and volatility on IT project performance
The Influence of Size and Coverage on Test Suite
The Marginal Value of Increased Testing: An Empirical Analysis
The nature of the times to flight software failure during space missions
Theoretical and Practical Aspects of Programming Contest Ratings
The Performance of the N-Fold Requirement Inspection Method
The Reaction of Open-Source Projects to New Language Features:
The Role of Contracts on Quality and Returns to Quality in Offshore
The Stagnating Job Market for Young Scientists
Turnover of Information Technology Professionals:
Unconventional applications of compiler analysis
Unifying DVFS and offlining in mobile multicores
Use of Structural Equation Modeling to Empirically Study the Turnover
Use Two-Level Rejuvenation to Combat Software Aging and
Using Function Points in Agile Projects
Using Learning Curves to Mine Student Models
Virtual Integration for Improved System Design
Which reduces IT turnover intention the most: Workplace characteristics
Why Did Your Project Fail?
Within-Die Variation-Aware Dynamic-Voltage-Frequency

Author emails (automatically extracted and manually checked to remove people who have replied on other issues; I hope I have caught them all).

Aaron.Carroll@nicta.com.au   abaker@ucar.edu   abd_elzamly@yahoo.com
actjn@siu.edu   agopal@rhsmith.umd.edu   akbar.namin@ttu.edu
aken@nsuok.edu   akmassey@umbc.edu   alessandro.murgia@uantwerpen.be
alexander.budzier@sbs.ox.ac.uk   alinebrito@dcc.ufmg.br
Allen.P.Nikora@jpl.nasa.gov   Altaf.Ahmad@asu.edu   Ana.Aizcorbe@bea.gov
angel.garcia@uc3m.es   anhnt@iastate.edu   a.pinna@diee.unica.it
arho.suominen@vtt.fi   arie.vandeursen@tudelft.nl   asang@ntu.edu.sg
awfboh@ntu.edu.sg   bent.flyvbjerg@sbs.ox.ac.uk
bf@ul.ie   bjg@empiricalreality.com   bojan.spasic@avl.com
bramesh@gsu.edu    brent.martin@canterbury.ac.nz
briand@simula.no   brian.fitzgerald@lero.ie   bronevetsky1@llnl.gov
burairah@utem.edu.my   calikli@chalmers.se   canton@mnec.gr
cc05@vokac.org   celio.santana@gmail.com   cguo13@hawk.iit.edu
charngda@ccr.buffalo.edu   charngdalu@yahoo.com   chenyy@comp.nus.edu.sg
chris.sauer@sbs.ox.ac.uk   christian.korunka@univie.ac.at   christopher.lidbury10@imperial.ac.uk
clitecky@business.siu.edu   cmagee@mit.edu   corey.phelps@mcgill.ca
cotroneo@unina.it   cthompson@cs.berkeley.edu
daniela.munteanu@univ-provence.fr   daniel.milroy@colorado.edu   dan@silverthreadinc.com
david@merobe.com   david.nembhard@oregonstate.edu   der.herr@hofr.at
dgrtwo@princeton.edu   dhkim@astate.edu   director@scit.edu
discy@nus.edu.sg   djl68@pitt.edu   dlautner@hawk.iit.edu
dport@hawaii.edu   dprtchan@nus.edu.sg   dredman@avsi.aero
drobinson@stackoverflow.com   dskusumo.itt@gmail.com   dwheeler@ida.org
eherrman@eva.mpg.de   Enrique.Dans@ie.edu
ermira.daka@sheffield.ac.uk   etovar@fi.upm.es   fjshull@sei.cmu.edu
foreverheart9@gmail.com   founders@triplebyte.com   fschweitzer@ethz.ch
ghs2@psu.edu   gleison.brito@dcc.ufmg.br   glpkm@hotmail.com
gordon.fraser@uni-passau.de   greg@bronevetsky.com   gul.calikli@gu.se
guschroko@student.gu.se   hankhoffmann@cs.uchicago.edu   hannes.holm@foi.se
hata@is.naist.jp   hbarth@wesleyan.edu
hello@ponyfoo.com   hiroshi.igaki@oit.ac.jp   hirtle@pitt.edu
hoan@iastate.edu   hora@dcc.ufmg.br   hrideshg@iastate.edu
huang@umd.edu   huazhe@cs.uchicago.edu   hwu28@hawk.iit.edu
ichischneider@gmail.com   I.Deary@ed.ac.uk   ilaria.lunesu@diee.unica.it
info@targetprocess.com   james@jpallister.com   jarmo.ahonen@uef.fi
jasmin.blanchette@mpi-inf.mpg.de   jasonweiyi@gmail.com   javier.alonso@duke.edu
jean-luc.autran@univ-provence.fr   jfmendes@ua.pt   jgo@ua.pt
jianh@illinois.edu   jimbo@business.siu.edu   jmunson@uidaho.edu
jo-anne.lefevre@carleton.ca   john.krogstie@ntnu.no   john.zhang@business.uconn.edu
jordan.weissmann@slate.com   jose.campos@sheffield.ac.uk   josephborel@aol.com
jselby@maplesoft.com    June.Verner@gmail.com
junyang@engr.pitt.edu   justinek@alumni.stanford.edu   justin.hollands@drdc-rddc.gc.ca
j.visser@sig.eu   kaisa.still@vtt.fi   kantor@cs.technion.ac.il
kevin.mcdaid@dkit.ie   kewusi@lmu.edu
K.Markantonakis@rhul.ac.uk   konstantinos.chronis@gmail.com   ktrivedi@duke.edu
laertexavier@dcc.ufmg.br   larissanadja@copin.ufcg.edu.br   lcao@odu.edu
leo@susaventures.com   lionel.briand@uni.lu   lsarigia@pme.duth.gr
lucia.2009@smu.edu.sg   magnus@magnusdettmar.com   mail@kaidence.org
ma.khan@uleth.ca   manuel.oriol@ch.abb.com   marc.schulz@rwth-aachen.de
Marek@gryting.biz   marie-jeanne.lesot@lip6.fr   mariusz.musial@ericpol.com
maruyama@atr.jp   matthias.biggeleben@open-xchange.com   matthias.stuermer@iwi.unibe.ch
mcknight@bus.msu.edu   mdettmar@deloitte.com   mdettmar@deloitte.se
melanie@cs.columbia.edu   Michael.english@lero.ie   michael.english@ul.ie
michael.grottke@fau.de   Michael.Grottke@wiso.uni-erlangen.de   Michael@targetprocess.com   mingshu@iscas.ac.cn
mischael.schill@inf.ethz.ch   misof@ksp.sk   mjaber@ryerson.ca
monica.pais@ifgoiano.edu.br   monicaspais@gmail.com
mschermann@scu.edu   mtov@dcc.ufmg.br   mzhu@ets.org
ncerpa@utalca.cl   Neil.Stewart@warwick.ac.uk   Nelson.W.Green@jpl.nasa.gov
nick.wells@jobstats.co.uk   o.alexy@tum.de   oliver.krancher@iwi.unibe.ch
Oliver.Laitenberger@horn-company.de   olivier.gendreau@polymtl.ca   paula.j.savolainen@uef.fi
paulmcb@seas.upenn.edu   paul@strassmann.com   pchatzog@pme.duth.gr
perry@mail.utexas.edu   philippe.roche@st.com   phoonakker@cqpi.engr.wisc.edu
pierre.robillard@polymtl.ca   ploaiza@lsm.in2p3.fr   P.Love@curtin.edu.au
pokech@uonbi.ac.ke   psidhu@cmu.edu
pyzychen@gmail.com   ren@iit.edu   rh13@aub.edu.lb
ricardo.colomo@uc3m.es   rkiyer@illinois.edu   robert.benkoczi@uleth.ca
roberto.natella@unina.it   roberto.pietrantuono@unina.it   salvaneschi@cs.tu-darmstadt.de
saurabh.dighe@intel.com   sdorogov@ua.pt   sebastien.lefort@lip6.fr
sebastien.sauze@l2mp.fr   shaji@scit.edu   shilin@itechs.iscas.ac.cn
show@um.edu.my   siegfrie@adelphi.edu   simona.ibba@diee.unica.it
simon.gaechter@nottingham.ac.uk   simonk@rpi.edu   simvrh@gmail.com
sl@monochromata.de   soenke.albers@the-klu.org   songxue@microsoft.com
s.raemaekers@sig.eu   sriram.vangal@intel.com   ssg@engr.uconn.edu
stavrino@eap.gr   stavrino@gmail.com   stefan@garage-coding.com
sterusso@unina.it   steve.a.shogren@gmail.com   svkbharathi@scit.edu
swilson@tcd.ie   tamada@cse.kyoto-su.ac.jp   tien@iastate.edu
tien.n.nguyen@utdallas.edu   tjleffel@gmail.com   tkabdelh@nps.edu
tsunoda@info.kindai.ac.jp   tung@iastate.edu   victoria@stodden.net
wangyi@us.ibm.com   William.L.Taber@jpl.nasa.gov   wmhan@takming.edu.tw
wobbrock@uw.edu   wq@itechs.iscas.ac.cn   xenos@eap.gr
xhua@hawk.iit.edu   xiao.qu@us.abb.com   yanglusi@comp.nus.edu.sg
ychen200@cba.ua.edu   yi.wang@rit.edu   yiw@ics.uci.edu
yoaval@checkpoint.com   zhangx@nku.edu   zhij@cs.toronto.edu
Zhongju.Zhang@asu.edu   zibran@cs.uno.edu

Update:

Have received a response relating to 6 papers (corresponding paper/author entries in above list deleted).

Categories: Uncategorized Tags: , ,

Algorithms are now commodities

July 5, 2020 10 comments

When I first started writing software, developers had to implement most of the algorithms they used; yes, hardware vendors provided libraries, but the culture was one of self-reliance (except for maths functions, which were technical and complicated).

Developers read Donald Knuth’s The Art of Computer Programming, it was the reliable source for step-by-step algorithms. I vividly remember seeing a library copy of one volume, where somebody had carefully hand-written, in very tiny letters, an update to one algorithm, and glued it to the page over the previous text.

Algorithms were important because computers were not yet fast enough to solve common problems at an acceptable rate; developers knew the time taken to execute common instructions and instruction timings were a topic of social chit-chat amongst developers (along with the number of registers available on a given cpu). Memory capacity was often measured in kilobytes, every byte counted.

This was the age of the algorithm.

Open source commoditized algorithms, and computers got a lot faster with memory measured in megabytes and then gigabytes.

When it comes to algorithm implementation, developers are now spoilt for choice; why waste time implementing the ‘low’ level stuff when there were plenty of other problems waiting to be implemented.

Algorithms are now like the bolts in a bridge: very important, but nobody talks about them. Today developers talk about story points, features, business logic, etc. Given a well-defined problem, many are now likely to search for an existing package, rather than write code from scratch (I certainly work this way).

New algorithms are still being invented, and researchers continue to look for improvements to existing algorithms. This is a niche activity.

There are companies where algorithms are not commodities. Google operates on a scale where what appears to others as small improvements, can save the company millions (purely because a small percentage of a huge amount can be a lot). Some company’s core competency may include an algorithmic component (whose non-commodity nature gives the company its edge over the competition), with the non-core competency treating algorithms as a commodity.

Knuth’s The Art of Computer Programming played an important role in making viable algorithms generally available; while the volumes are frequently cited, I suspect they are rarely read (I have not taken any of my three volumes off the shelf, to read, for years).

A few years ago, I suddenly realised that I was working on a book about software engineering that not only did not contain an algorithms chapter, and the 103 uses of the word algorithm all refer to it as a concept.

Today, we are in the age of the ecosystem.

Algorithms have not yet completed their journey to obscurity, which has to wait until people can tell computers what they want and not be concerned about the implementation details (or genetic algorithm programming gets a lot better).

Categories: Uncategorized Tags: , ,