GDPR has a huge impact on empirical software engineering research
The EU’s General Data Protection Regulation (GDPR) is going to have a huge impact on empirical software engineering research. After 25 May 2018, analyzing source code will never be the same again.
I am not a lawyer and nothing qualifies me to talk about the GDPR.
People put their name in source code, bug tracking databases and discussion forums; this is personal identifying information.
Researchers use personal names to obtain information about a wide variety of activities, e.g., how much code did individuals write, how many bug reports did they process, contributions in discussions of one sort or another.
Open source licenses give others all kinds of rights (e.g., ability to use and modify source code), but they do not contain any provisions for processing personal data.
Adding a “I hereby give permission for anybody to process information about my name in any way they see fit.” clause to licenses is not going to help.
The GDPR requires (article 5: Principles relating to processing of personal data):
“Personal data shall be: … collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes;”
That is, personal data can only be processed for the specific reason it was collected, i.e., if you come up with another bright idea for analysis of data that has just been collected, it may be necessary to obtain consent, from those whose personal data it is, before trying out the bright idea.
It is not possible to obtain blanket permission (article 6, Lawfulness of processing):
“…the data subject has given consent to the processing of his or her personal data for one or more specific purposes;”, i.e., consent has to be obtained from the data subject for each specific purpose.
Github’s Global Privacy Practices shows that Github are intent on meeting the GDPR requirements, they include: “GitHub provides clear methods of unambiguous, informed consent at the time of data collection, when we do collect your personal data.”. Processing personal information, about an EU citizen, contained in source code appears to be a violation of Github’s terms of service.
The GDPR has many other requirements, e.g., right to obtain information on what information is held and right to be forgotten. But, the upfront killer is not being able to cheaply collect lots of code and then use personal information to help with the analysis.
There are exceptions for: Processing for archiving, scientific or historical research or statistical purposes. Can somebody who blogs and is writing a book claim to be doing scientific research? People who know more about these exceptions than me, tell me that there could be a fair amount of paperwork involved when making use of the exception, i.e., being able to show that privacy safeguards are in place.
Then, there is the issue of what constitutes personal information. Git’s hashing algorithm makes use of the committer’s name and/or email address. Is a git hash personal identifying information?
A good introduction to the GDPR for developers, and one for researchers.
This blog is a fine example of common GDPR myths. Had the author continued to read article 6(1) of the GDPR, he would have found that there are other grounds for processing of personal data than consent. No less than five, even, of which the legitimate interest of the data processor applies rather well to the examples given by him. Which kind of takes away the whole point of this blog.
Only a problem if you have children submitting code:
‘processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.’
What *is* a problem is this silly form with name and e-mail, sent over http. See you in a month 😉
@Walter van Holst
Ok, let’s go through the five other grounds:
“b) processing is necessary for the performance of a contract to which the data subject is party …”
Getting someones agreement to using their personal data is hard enough, entering into a contract with them is likely to be a lot more work.
“c) processing is necessary for compliance with a legal obligation to which the controller is subject;”
Can somebody please place a legal obligation on me that requires I research software engineering data.
“d) processing is necessary in order to protect the vital interests of the data subject or of another natural person;”
Can I claim it’s in my vital interests to research software engineering data?
*e) processing is necessary for the performance of a task carried out in the public interest…”
Yes! My work is definitely in the public interest! Sounds like the scientific research exemption, i.e., lots of paperwork, but doable.
“f) processing is necessary for the purposes of the legitimate interests pursued by the controller or by a third party,…”
This does not sound like it applies here.
Oh my. The six legal basis for processing do not all apply to the data subject. Consent is the data subject. Legitimate interest and contracts are not with the data subject, nor are the others. Please grab the 77p book by Alan Calder. a Privacy Engineer