Cloning research needs a new mantra
The obvious answer to software engineering researchers who ask why their findings are not applied within industry is that their findings provide no benefits to industry. Anyone who digs into the published research finds that in fact there is lots of potentially useful stuff in there, the problem is that researchers often take too narrow a perspective.
A good example of a research area that is generally ignored by industry but has potential for widespread benefits is software cloning; that is chunks of source code that are duplicated within the same application (a chunk may be as little as five lines or may be more, and the definition of duplicate varies from exactly the same character sequence, through semantic equivalence to chilling out with a certain percentage of lines being the same {with various definitions for ‘same’}). (This is not about duplication of code in multiple versions of the same product, we all know how nasty that can be to maintain).
Researchers regard cloning as bad, while I suspect many developers are neutral on the subject or even in favor of creating and using duplicate code.
Clone research will be ignored by industry while researchers continue to push the mantra “clones are bad”. It just does not gel with industry’s view.
Developers are under pressure to deliver working software; if they can save time by (legally) making use of existing code then there is an immediate benefit to them and their employer. The researchers’ argument is that clones increase maintenance costs (a fault being fixed in one of the duplicates but not the other(s) is often cited as the killer case for all clones being bad). What developers know is that most code is never maintained (e.g., is is rewritten, or never used again or works fine and does not need to be changed).
Do company’s that own software care about it containing clones? They are generally more interested in meeting deadlines and being first to market. If a product is a success it will be worth paying its maintenance costs; why risk spending extra time/money on creating a beautifully written product when most products don’t well well enough to be worth maintaining? If the software is bespoke, for in-house use or by a client, then increased maintenance costs are good for those involved in writing the software (i.e., they get paid to maintain it).
The new clone research mantra should be that clones have benefits and costs, and the research results help increase benefits and decrease costs. How does this increase/decrease work? You’re the researchers, you tell me.
My own experience with clones is that they do sometimes multiply costs (i.e., work has to be done more than once) but overall their creation and use is very cost effective, as for ‘missed’ fault fixes clones are a small subset of this use case.
I have heard of projects where there has been rampant copying, plus minor modification, of code within the project. If such projects fail then the issue is one of project management and control, with cloning being one of the consequences.
The number of clones usually found in a large software system is surprisingly high; . If you want to check out the clones in your own code CCFinder is well worth a look. The most common use for such tools is plagiarism detection.
Due to management cluelessness at my work, we have “code reuse” engrained in our system. Here’s a concept the researchers aren’t considering: not all “fixes” actually fix things. So fixing that shared function will end up breaking systems that have been working fine previously.
Actually I’ve been pushing for the concept of “clone and own.” Copy an entire project, start working with it, and don’t give a crap what people do to the original! (Clone and own is popular in the aerospace industry. )
I think there’s a fine line, and only an expert human can judge it, which is why we are not going to run out of jobs for humans any time soon.
If you want to avoid cloning, you need to build libraries. If you have two projects which share one library, you now have three separate maintenance domains to work with… and that can be really good if you also happen to have three developers, and three logically separate concepts that should rightly be separately maintained. However, it can also be really bad if you find the one library doing too many jobs and trying to be all things to all people. So fundamentally we always get back to the question of correct subdivision of a project into components, which in turn comes down to genuine understanding of the problem you are solving.
I would guess that statistical code analysis can probably give you a comparative estimate of how well the design was done, based on how much cloning exists… but this is merely a symptom of bad design, not a cause. As Derek points out, beanhead managers will attempt to game the stats by addressing the symptom, which won’t help at all to fix a bad design.
Finally I’ll point out that if you use “git clone” then you can very easily track changes in the original and merge in the ones you like. Even better if you have a bug tracking system, with keywords, etc.