Evaluation of RQ 1.2: Test Distribution according to Developer Classification 85

5. Distribution of Unit and Integration Tests in Open-Source Projects 77

5.3. Evaluation of RQ 1.2: Test Distribution according to Developer Classification 85

This RQ is similar to RQ 1.1, but tries to evaluate the trend that there are more integration than unit tests developed, from a different perspective. Within this RQ, we want to assess if the test level classification done by the developers are showing the trend explained in the introduction. This section summarizes our analysis procedure (Section 5.3.1), as well as the results of the analysis (Section 5.3.2).

5.3.1. Analysis Procedure

For the assessment of this RQ, we acquire the test level classification for the DEV rule set of each test that we analyze. Afterwards, we perform the same analysis as presented in Section 5.2.1, including the normalization due to the same reasons.

Our analysis process is described in detail in the following:

1. Gather the Data: We gather the tests with their classification results that were cre-ated via the DEV rule set, i.e., we queryUDEV andIDEV from the database. Further-more, we query the sum of TestLOC for these sets, i.e.,tl(UDEV)andtl(IDEV).

2. Normalization: Afterwards we perform the same normalization and create the same metric sets as explained in Section 5.2.1. Hence, we create the NM_C(UDEV), NMC(IDEV),NMT L(UDEV), andNMT L(IDEV)sets.

3. Check Preconditions for Statistical Testing: We separately check for each of the above mentioned sets if the values within these sets follow a normal distribution us-ing the Shapiro-Wilk test (Section 2.3.5.1). Moreover, we separately test for ho-moscedasticity using the Brown-Forsythe test (Section 2.3.5.2) betweenNMC(UDEV) andNM_C(IDEV), as well as betweenNM_{T L}(UDEV)andNM_{T L}(IDEV). These tests are performed in order to choose the correct statistical significance test in the next step.

DefinitionDEV

Figure 5.2.: Box-plots of thenm_Cmetric (left) andnm_{T L}metric (right) for unit and integra-tion tests and the DEV rule set. The points in the plot represent the concrete values for each project.

4. Statistical Testing:Based on the results of the previous step, we chose a fitting one-sided significance test (Section 2.3.1) to test for differences between NMC(UDEV) andNMC(IDEV), as well as betweenNMT L(UDEV)andNMT L(IDEV)due to the same reasons as explained in Section 5.2.1.

5.3.2. Results

Table 5.5 depicts thenmCandnmT Lmetrics for each project for the DEV rule set. Overall, there are 14 projects for thenm_Cmetric, where the trend of developing more integration than unit tests holds true. These projects are mostly Java projects (11), while only three Python projects follow the trend. The numbers are similar for thenmT Lmetric: overall, there is one project more than for thenm_C metric, where the trend of developing more integration than unit tests holds true. These 15 projects include the same 11 Java projects as for thenmC

metric. Furthermore, one more Python project is part of this set. Nevertheless, the mean for both metrics is higher for unit (20.46 (nmC), 608.93 (nmT L)) than for the integration tests (16.19 (nmC), 516.83 (nmT L)), while all sets have a high standard deviation. This gives a first hint that there are more unit than integration tests if we rely on the developer classification.

Figure 5.2 depicts box-plots of the nmC (left) and thenmT L (right) metrics. If we com-pare them visually, we can determine that the shown median for both sets (i.e., unit and integration tests) are similar in both figures. This indicates that there is no difference in both metrics between unit and integration tests.

While we do not see a difference in the data (Table 5.5) or visually (Figure 5.2) we need to assess it statistically. Hence, we performed several statistical tests. The concrete p-values and test statistics for each statistical test is reported in Appendix C. The Shapiro-Wilk tests showed that only thenmCvalues for the set of integration tests follow a normal distribution,

Project nm_C(UDEV) nm_C(IDEV) nmT L(UDEV) nmT L(IDEV)

commons-beanutils 0.00 35.16 0.00 1659.91

commons-codec 35.96 6.94 412.21 95.86

commons-collections 82.67 14.97 4000.03 1334.89

commons-io 14.68 24.36 281.17 807.20

commons-lang 32.41 19.10 502.65 262.77

commons-math 26.00 5.05 686.90 137.23

druid 1.26 16.22 26.52 279.01

fastjson 1.99 26.40 19.23 276.86

google-gson 13.51 33.46 128.79 488.47

guice 5.53 17.41 114.01 343.58

HikariCP 1.50 8.91 39.17 504.85

jackson-core 2.14 16.47 44.73 546.01

jfreechart 15.70 0.51 243.96 8.67

joda-time 16.12 32.47 312.89 704.04

jsoup 21.35 10.01 434.25 231.29

mybatis-3 8.80 12.60 203.29 411.57

zxing 1.07 11.55 25.05 458.80

ChatterBot 12.08 25.49 79.26 386.35

csvkit 61.95 0.00 720.69 0.00

dpark 4.53 0.08 119.12 0.95

mrjob 45.45 4.31 4818.83 2702.85

networkx 23.03 21.59 532.44 540.06

pyramid 50.62 7.70 1660.72 173.30

python-telegram-bot 26.80 6.28 329.06 44.42

rq 33.38 3.85 404.73 57.77

schematics 9.38 31.07 128.09 673.68

scrapy 16.98 45.27 173.44 823.94

Mean 20.92 16.19 608.93 516.83

StDev 20.46 12.18 1152.01 587.75

Table 5.5.: Normalized test count values (nmC) and normalizedtl values (nmT L) for each project.

while all other tested sets do not. The Brown-Forsythe test showed, that all tested sets are homoscedastic. The U-statistic was not significant at the .005 critical alpha level, for all tested sets. Therefore, we fail to rejectH₀for all of the tested sets.

In addition to the one-sided Mann-Whitney-U test, we performed a two-sided test to assess, if there are any differences at all between unit and integration tests for the two assessed metrics. The U-statistic was not significant at the.005 critical alpha level for all tested sets. Hence, in addition to the one-sided test we also fail to reject theH₀for all tested sets for the two-sided test. Therefore, we conclude that there is no statistically significant difference between unit and integration tests according to the developer classification.

Answer to RQ 1.2: Table 5.5 and Figure 5.2 show that there are no differences in the median and the mean between the amount of unit and integration tests (measured in nmCandnmT L), while a high standard deviation exists for thenmT Lmetric. To assess if the table and figures provide a correct view, we performed several statistical tests, which all failed to reject H₀. Also the two-sided test failed to reject H₀. Therefore, we conclude that when relying on the developer classification, the trend of developing more integration than unit tests is not visible in open-source software projects. Fur-thermore, our two-sided test shows that there is no difference at all between unit and integration tests for the examined metrics according to the developer classification.

5.4. Evaluation of RQ 1.3: Developer Classification according to

Im Dokument An Analysis of the Differences between Unit and Integration Tests (Seite 111-114)