• Keine Ergebnisse gefunden

Automated Parameter Estimation for Network Based Models

5. The Gathering of Parameters for Model Execution 57

5.2. Automated Parameter Estimation for Network Based Models

metrics2.

issueSHARK: This tool gathers information from ITSs and thus provides data for the bug creation in the simulation3.

These three plug-ins of SmartSHARK provide the necessary data for our mining tools to estimate simulation parameters. The data model created by the plug-ins is presented in Section C.2.

For the experiments in this thesis, which are based on SmartSHARK data, we are using a MongoDB hosted at the Institute of Computer Science at the University of G¨ottingen. This database contains data of several Apache Java projects analyzed with SmartSHARK.

5.2. Automated Parameter Estimation for Network Based Models

The mining framework presented in this Section gathers simulation parameters of projects that are analyzed with SmartSHARK before. It consists of the following four components as depicted in Figure 5.2.

Developer Information Provider: Since we are mainly interested in the developer’s contribution to the project for simulation parameters, this tool collects all developers that are authors of at least one commit. Afterwards, two tasks are performed. First, as one and the same developer partially applies commits with different e-mail addresses, the identities of the developers are merged using an adapted identity merging algorithm. Second, developers, or more precisely the groups of developers belonging to one identity, are classified into different types.

Bug Information Provider: This tool investigates theIssue collection and provides information about the number of bugs that are created and fixed per year.

Furthermore, all occurring priorities are mapped to the following three: major, normal,minor.

Commit Analyzer: To provide update probabilities for several commit types this tool investigates the FileAction information for a Commit. Thus, the number of updated, added, and deleted files per commit as well as the number of com-mits by type are available. This information is used to calculate the geometric distribution probabilities that describe the developer’s contribution behavior.

Currently, we use two commit types: an general commit type and a bugfix.

The commit classification based on [141] is provided by SmartSHARK.

2https://smartshark.github.io/mecoSHARK/intro.html

3https://github.com/smartshark/issueSHARK

5. The Gathering of Parameters for Model Execution 60

Developer Information Provider

Bug Information Provider

Commit Analyzer

Change Coupling Creator

- Identity merging - Developer classification

Information about - open bugs - fixed bugs

based on ITS information

Computes several commit probabilites

Creates change coupling graph with additional file information about

- developer touches and - assigned categories uses

Figure 5.2.: Components of the overall process of automated parameter estimation.

Change Coupling Creator: This tool creates a change coupling graph for every year or any selected point in time as input for the simulation.

The computed data by the first three mining tools is exported as described in the example in Section A.1. An example for the exported change coupling graph can be found in Section A.2.

5.2.1. Developer Identity Merging

Our identity merge algorithm follows the algorithm presented in Figure 5.3 and introduced by Geominne et al. [69]. This algorithm describes a default merge algo-rithm which takes a set of people and returns a set of grouped identities. If people are merged depends mainly on the decision whether the people currently not added to an identity group can be merged with the person selected in the current run and, thus, be added to the identity grouptmpMerges currently under construction.

The function canMerge(m, i, t) returns a boolean representing whether an iden-tity i can be merged to the identity group m according to the parameter t. To decide whether an identity can be merged or not, we implemented an extended ver-sion of the simple algorithm presented in [69]. Since the person objects from the SmartSHARK database include the name, the email address and the username, we

61 5.2. Automated Parameter Estimation for Network Based Models

1: P {Set of personsP ={p1, p2, p3,· · · , in}}

2: t {Similarity threshold,t >0}

3: identities← emptySet(){The merged identities}

4: while isNotEmpty(P) do

5: p← getFirstElement(P)

6: tmpM erges← emptySet(){merged ID group for the current person p}

7: insert(tmpM erges, p)

8: remove(P, p)

9: for allx∈P do

10: if canMerge(tmpM erge, x, t)then

11: insert(tmpM erges, x)

Figure 5.3.: General identity merge algorithm adapted from [69].

can use all these information to compare their identities. For each person we create labels using all available information. For the username we create one label. An additional label is created for the prefix of the e-mail address. Further labels are created based on the name and on the number of parts in which a name can be split.

As separator a dot or a space is used. The following labels are created.

• Separators = 0: A label for the name is created.

• Separators = 1: Assuming that a name consists of tow parts, representing the first name f and the last name l, we create labels for the following combina-tions: l.f,f.l,lf,f l.

• Separators = 2: Assuming that a name consists of three parts, representing the first namef, the middle name m, and the last namel, we create labels for the following combinations: l.f,f.l,lf,f l,l.m,m.l,lm,ml,f ml,f m.l.

To add a person to an identity group, all generated labels of the current person are compared with the labels of the current merge group tmpMerges. If two are equal in their firsttcharacters, the person is added.

The comparison of the labels with each other is improved by the fact that a nor-malization method is used during the generation of the label. This means that all spaces are removed and uppercase letters are converted to lowercase letters.

To omit identities that are insignificant, we prevent the generation of labels for com-mon email prefixes that probably do not represent a person like ”mail”, ”dev null”,

”dev-null”, ”noreply”, or ”github”. Furthermore, the name ”unknown” is omitted.

5. The Gathering of Parameters for Model Execution 62 The authors of [69] figured out that a simple algorithm produces good results with the parametertset to 3. This parameter selection also works well in our implemen-tation tested against a manually merged set of people.

5.2.2. Developer Classification

The developers are classified into core and peripheral according to the prominent onion model as well as inkey,major,minor, andmaintainer like introduced for our simulation models.

The classification is based on the contribution of the developer, more precisely on the group of identities the developer belongs to. For this purpose, the contribution of each developer in a group is added up. This requires that the identities have been merged beforehand.

5.2.3. Change Coupling

To initialize the simulation at any time, we need a continuous change coupling graph.

This means that changes are not forgotten after a certain time. Therefore, realistic developer information can be included in the graph. Before the change coupling graph is generated, there are two preparatory tasks to do.

First, the collection of analyzed commits is sorted ascending by the commit date.

Second, all files in a commit that have been renamed are mapped to a representa-tive. Thus, developer information can be assigned to a file. Afterwards, the change coupling graph is created as follows.

During the iteration over all commits, the change coupling graph is generated suc-cessively. For each commit, the following steps are executed:

1. For each file the representative is searched in the created map. This differs from the actual file only if it has been renamed. When we talk about files, always the representative is meant in the following.

2. It is checked whether the file is already contained in the current change cou-pling graph. If so, the developer information is updated. These are, for exam-ple, the number of touches as well as the owner. If this is not the case, a new file will be generated with initial information about the creator, its category, and a unique name.

3. Finally, the edges are created or updated. If an edge already exists between two different files of a commit, their weight is incremented by 1. Otherwise, a new edge of weight one is generated.

63 5.3. Parameter Estimation for the Modeling of Refactorings