Archive

Posts Tagged ‘social network’

GitHub, the logical next purchase for Microsoft after LinkedIn

June 14, 2016 No comments

Active directory sits at the center of the Microsoft corporate product line, providing identity related services. All the Microsoft services use it to figure out what users can and cannot do. Ensuring it was possible for third parties to implement Active Directory was on the must-do task list during the compliance phase of the EU/Microsoft competition case.

With the LinkedIn purchase Microsoft has acquired the base from which to implement Active Directory for companies operating in the cloud; the identity service used by companies to monitor employees looks like it will continue to be owned and operated by Microsoft.

People have been known to be inventive when writing their cvs and LinkedIn profiles are unlikely to be any different. How do Microsoft introduce some quality control to the claims appearing in profiles?

One way of verifying the claims made by software developers, about what projects they have worked on and how much they contributed, is to look at the code they have written. If Microsoft owned GitHub they would be in a position to do just that, starting with its 14 million users.

What you know a developers employment history and the code they wrote, all sorts of services can be offered. Companies are rightly concerned about the intellectual property of the software they produce and use. Microsoft would be in a position to provide a code IP checking service, the code produced by company employees could be compared against the code they wrote at previous companies to produce a similarity value. This checking option can be sold to all companies in a developer’s work history; companies want to know that when an employee leaves they don’t take anything with them.

Technical details. You would be correct to point out that quantity of code written is not a sensible predictor of developer productivity. Not a problem when selling to management, there are enough academic studies associating quantity with productivity to make management believe otherwise.

Microsoft don’t actually need to see the source code to perform a lot of similarity checks. Many clone detection tools work by comparing hashes of small sequences of particular code features; comparing constructs.

Prioritizing project stakeholders using social network metrics

April 21, 2013 No comments

Identifying project stakeholders and their requirements is a very important factor in the success of any project. Existing techniques tend to be very ad-hoc. In her PhD thesis Soo Ling Lim came up with a very interesting solution using social network analysis and what is more made her raw data available for download 🙂

I have analysed some of Soo Ling’s data below as another draft section from my book Empirical software engineering with R. As always comments and pointers to more data welcome. R code and data here.

A more detailed discussion and analysis is available in Soo Ling Lim’s thesis, which is very readable. Thanks to Soo Ling for answering my questions about her work.

Stakeholder roles and individuals

A stakeholder is a person who has an interest in what an application does. In a well organised development project the influential stakeholders are consulted before any contracts or budgets are agreed. Failure to identify the important stakeholders can result in missing or poorly prioritized requirements which can have a significant impact on the successful outcome of a project.

While many people might consider themselves to be stakeholders whose opinions should be considered, in practice the following groups are the most likely to have their opinions taken into account:

  • people having an influence on project funding,
  • customers, i.e., those people who are willing pay to use or obtain a copy of the application,
  • domain experts, i.e., people with experience in the subject area who might suggest better ways to do something or problems to try and avoid,
  • people who have influence over the success or failure over the actual implementation effort, e.g., software developers and business policy makers,
  • end-users of the application (who on large projects are often far removed from those paying for it).

In the case of volunteer open source projects the only people having any influence are those willing to do the work. On open source projects made up of paid contributors and volunteers the situation is likely to be complicated.

Individuals have influence via the roles they have within the domain addressed by an application. For instance, the specification of a security card access system is of interest to the role of ‘being in charge of the library’ because the person holding that role needs to control access to various facilities provided within different parts of the library, while the role of ‘student representative’ might be interested in the privacy aspects of the information held by the application and the role of ‘criminal’ has an interest in how easy it is to circumvent the access control measures.

If an application is used by large numbers of people there are likely to be many stakeholders and roles, identifying all these and prioritizing them has, from experience, been found to be time consuming and difficult. Once stakeholders have been identified they then need to be persuaded to invest time learning about the proposed application and to provide their own views.


The RALIC study

A study by Lim <book Lim_10> was based on a University College London (UCL) project to combine different access control mechanisms into one, such as access to the library and fitness centre. The Replacement Access, Library and ID Card (RALIC) project had more than 60 stakeholders and 30,000 users, and has been deployed at UCL since 2007, two years before the study started. Lim created the StakeNet project with the aim of to identifying and prioritising stakeholders.

Because the RALIC project had been completed Lim had access to complete project documentation from start to finish. This documentation, along with interviews of those involved, were used to create what Lim called the Ground truth of project stakeholder role priority, stakeholder identification (85 people) and their rank within a role, requirements and their relative priorities; to quote Lim ‘The ground truth is the complete and prioritised list of stakeholders and requirements for the project obtained by analysing the stakeholders and requirements from the start of the project until after the system is deployed.’

The term salience is used to denote the level of a stakeholder’s influence.


Data

The data consists of three stakeholder related lists created as follows (all names have been made anonymous):

  • the Ground truth list: derived from existing RALIC documentation. The following is an extract from this list (individual are ranked within each stakeholder role):
Role Rank,      Stakeholder Role,       Stakeholder Rank,       Stakeholder
1,      Security and Access Systems,    1,              Mike Dawson
1,      Security and Access Systems,    2,              Jason Ortiz
1,      Security and Access Systems,    3,              Nick Kyle
1,      Security and Access Systems,    4,              Paul Haywood
2,      Estates and Facilities Division,1,              Richard Fuller
  • the Open list: starting from an initial list of 22 names and 28 stakeholder roles, four iterations of [Snowball sampling] resulted in a total of 61 responses containing 127 stakeholder names+priorities and 70 stakeholder roles,
  • the Closed list: a list of 50 possible stakeholders was created from the RALIC project documentation plus names of other UCL staff added as noise. The people on this list were asked to indicated which of those names on the list they considered to be stakeholders and to assign them a salience between 1 and 10, they were also given the option to suggest names of possible stakeholders. This process generated a list containing 76 stakeholders names+priorities and 39 stakeholder roles.

The following is an extract from the last two stakeholder lists:

stakeholder     stakeholder role salience
David Ainsley   Ian More        1
David Ainsley   Rachna Kaplan   6
David Ainsley   Kathleen Niche  4
David Ainsley   Art Waller      1
David Carne     Mark Wesley     4
David Carne     Lis Hands       4
David Carne     Vincent Matthew 4
Keith Lyon      Michael Wondor  1
Keith Lyon      Marilyn Gallo   1
Kerstin Michel  Greg Beech      1
Kerstin Michel  Mike Dawson     6

Is the data believable?

The data was gathered after the project was completed and as such it is likely to contain some degree of hindsight bias.

The data cleaning process is described in detail by Lim and looks to be thorough.


Predictions made in advance

Lim drew a parallel between the stakeholder prioritisation problem and the various techniques used to rank the nodes in social network analysis, e.g., the Page Rank algorithm. The hypothesis is that there is a strong correlation exists between the rank ordering of stakeholder roles in the Grounded truth list and the rank of stakeholder roles calculated using various social network metrics.


Applicable techniques

How might a list of people and the salience they assign to other people be converted to a single salience for each person? Lim proposed that social network metrics be used. A variety of techniques for calculating social network node centrality metrics have been proposed and some of these, including most used by Lim, are calculated in the following analysis.

Lim compared the Grounded truth ranking of stakeholder roles against the stakeholder role ranking created using the following network metrics:

  • betweenness centrality: for a given node it is a count of the number of shortest paths from all nodes in a graph to all other nodes in that graph that pass through the given node; the value is sometimes normalised,
  • closeness centrality: for a given node closeness is the inverse of farness, which is the sum of that node’s distances to all other nodes in the graph; the value is sometimes normalised,
  • degree centrality: in-degree centrality is a count of the number of edges referring to a node, out-degree centrality is the number of edges that a node refers to; the value is sometimes normalised,
  • load centrality: this is a variant of betweenness centrality based on the fraction of shortest paths through a given node. Support for load centrality is not available in the igraph package and so is not used here, this functionality is available in the statnet package,
  • pagerank: the famous algorithm proposed by Page and Brin <book Page_98> for ranking web pages.

Eigenvector centrality is another commonly used network metric and is included in this analysis.


Results

The igraph package includes functions for computing many of the common social network metrics. Reading data and generating a graph (the mathematical term for a social network) from it is particularly easy, in this case the graph.data.frame function is used to create a representation of its graph from the contents of a file read by read.csv.

The figure below plots Pagerank values for each node in the network created from the Open and Closed stakeholder salience ratings (Pagerank was chosen for this example because it had one of the strongest correlations with the Ground truth ranking). There is an obvious difference in the shape of the curves: the Open saliences (green) is fitted by the equation salience = {0.05}/{x^{0.5}} (black line), while the Closed saliences (blue) is piecewise fitted by salience = 0.05 * e^{-0.05x} and salience = 0.009 * e^{-0.01x} (red lines).

caption=

Figure 1. Plot of Pagerank of the stakeholder nodes in the network created from the Open (green) and Closed (blue) stakeholder responses (values for each have been sorted). See text for details of fitted curves.

To compare the ability of network centrality metrics to produce usable orderings of stakeholder roles a comparison has to be made against the Ground truth. The information in the Ground truth is a ranked list of stakeholder roles, not numeric values. The Stakeholder/centrality metric pairs need to be mapped to a ranked list of stakeholder roles. This mapping is achieved by associating a stakeholder role with each stakeholder name (this association was collected by Lim during the interview process), sorting stakeholder role/names by decreasing centrality metric and then ranking roles based on their first occurrence in the sorted list (see rexample[stakeholder]).

The Ground truth contains stakeholder roles not filled by any of the stakeholders in the Open or Closed data set, and vice versa. Before calculating role ranking correlation by roles not in both lists were removed.

The table below lists the Pearson correlation between the Ground truth ranking of stakeholder roles and for the ranking produced from calculating various network metrics from the Closed and Open stakeholder salience questionnaire data (when applied to ranks the Pearson correlation coefficient is equivalent to the Spearman rank correlation coefficient).

Table 1. Pearson correlation between Ground truth ranking of stakeholder roles and ranking created using various social network metrics (95% confidence intervals were around +/-0.2 of value listed; execute example R code for details).
betweenness closeness degree in degree out eigenvector pagerank
Open
0.63
0.46
0.54
0.52
0.62
0.60
Weighted Open
0.66
0.49
0.62
0.50
0.68
0.67
Closed
0.51
0.53
0.67
0.60
0.69
0.71
Weighted Closed
0.50
0.50
0.63
0.54
0.68
0.72

The Open/Closed correlation calculation is based on a linear ranking. However, plotting Stakeholder salience, as in the plot above, shows a nonlinear distribution, with the some stakeholders having a lot more salience than less others. A correlation coefficient calculated by weighting the rankings may be more realistic. The “weighted” rows in the above are the correlations calculated using a weight based on the equations fitted in the Pagerank plot above; there is not a lot of difference.


Discussion

Network metrics are very new and applications making use of them still do so via a process of trial and error. For instance, the Pagerank algorithm was found to provide a good starting point for ranking web pages and many refinements have subsequently been added to the web ranking algorithms used by search engines.

When attempting to assign a priority to stakeholder roles and the people that fill them various network metric provide different ways of interpreting information about relationships between stakeholders. Lim’s work has shown that some network metrics can be used to produce ranks similar to those actually used (at least for one project).

One major factor not included in the above analysis is the financial contribution that each stakeholder role makes towards the implementation cost. Presumably those roles contributing a large percentage will want to be treated as having a higher priority than those contributing a smaller percentage.

The social network metrics calculated for stakeholder roles were only used to generate a ranking so that a comparison could be made against the ranked list available in the Ground truth. A rank ordering is a linear relationship between stakeholders; in real life differences in priority given to roles and stakeholders may not be linear. Perhaps the actual calculated network metric values are a better (often nonlinear) measure of relative difference, only experience will tell.


Summary of findings

Building a successful application is a very hard problem and being successful at it is something of a black art. There is nothing to say that a different Ground truth stakeholder role ranking would have lead to the RALIC project being just as successful. The relatively good correlation between the Ground truth ranking and the ranking created using some of the network metrics provides some confidence that these metrics might be of practical use.

Given that information on stakeholders’ rating of other stakeholders can be obtained relatively cheaply (Lim built a web site to collect this kind of information <book Lim_11>), for any large project a social network analysis appears to be a cost effect way of gathering and organizing information.