Internal Validity - Threats to Validity and Validation Procedures

8. Discussion 137

8.3. Threats to Validity and Validation Procedures

8.3.3. Internal Validity

Internal validity threats are concerned with the ability to draw conclusion from the relation between causes and effects [413]. While we tried to create an isolated environment for our study (e.g., by using mutation testing) we might have overlooked some influencing variable that we did not account for. We applied different mechanisms to counter the influences that we know of, e.g., the normalization of results to calculate the defect detection capabilities (Section 6.2).

Furthermore, with our current approach we are not able to differentiate integration and system tests. It can happen that a system test gets classified as an integration test. To reduce this threat, we only included projects that are libraries or frameworks, because the possibility that they have systems tests is lower as for applications.

Another threat to the internal validity of our study is the execution of our statistical tests.

We rely on the accurate implementation of the algorithms that we used during our analysis.

Hence, we use a well known public library (i.e., SciPy [416]) to perform our statistical testing.

Within our qualitative analysis we analyzed different resources like research papers, de-veloper comments, and other internet resources. However, another methodology or paper that we might have overlooked could produce different results. Nevertheless, to ease this threat, our analysis is not only based on research papers to create a “scientific view”, but also includes other resources to create an “practitioners view”.

In this section, we conclude the thesis. Therefore, we provide a summary and give an outlook on potential future work.

9.1. Summary

In this thesis, we presented a qualitative and quantitative analysis of the differences between unit and integration tests. At first, we analyzed the distribution of unit and integration tests in open-source software projects according to different definitions. Afterwards, we explored and analyzed six differences between unit and integration tests, that were mentioned in the standard literature. Three of these differences were analyzed quantitatively, the others qualitatively.

We designed and implemented several approaches to collect data from software projects.

We developed an approach to classify software tests into unit and integration tests based on the definitions of the IEEE and ISTQB and the developer classification. Furthermore, we de-signed approaches to collect the TestLOC and pLOC of software tests. We also collected the defect detection capabilities of tests, where we applied mutation testing in combination with an approach to classify the mutants into different defect classes. Moreover, we designed an approach to extract the defect-locality of software tests using artificial defects. All of these approaches are combined into two different frameworks, which are open-source and free to use for the research community. This facilitates the replication of our study and enables other researchers to contribute to the body of knowledge of software testing research.

The quantitative analysis was done via a case study including 27 Java and Python projects with more than 49 000 tests. We classified tests into unit and integration tests according to the definitions of the IEEE and ISTQB and collected the above mentioned data from them.

We found that most current open-source projects possess more integration than unit tests, if tests are classified according to the above mentioned definitions. Nevertheless, there are differences in the number of unit and integration tests between the developer classification and the classification using the definitions of the IEEE and ISTQB. Our quantitative anal-ysis of the execution time reveled no differences between unit and integration tests for the execution time per covered line of code. More surprisingly, our results indicate that there is no difference in the overall and defect type specific effectiveness between unit and integra-tion tests. The last part of our quantitative analysis was the assessment of the defect-locality of unit and integration tests. Within this part, we found a statistically significant difference

between unit and integration tests for the IEEE and ISTQB definitions only, showing that unit tests are more likely to pinpoint the source of a defect.

For the qualitative analysis, we reviewed related literature out of the research and in-dustrial perspective. We created a holistic view on the analyzed differences at hand. Our qualitative analysis highlighted, that we are missing research in most of the analyzed fields to make scientifically grounded conclusions. The results of our analysis of the test execu-tion automaexecu-tion showed that unit, as well as integraexecu-tion tests are executed automatically.

There exist different approaches for the execution automation of unit and integration tests, e.g., CI systems. The results of our qualitative analysis of the test objective highlighted, that efficiency testing is mostly done on system level, while approaches exist that can be used for unit and integration level.We also identified a need for automated efficiency testing on lower test levels. Maintainability testing must be done on all levels to assess all aspects of maintainability. Nowadays projects often only make use of the MI to assess the main-tainability of software, which is rather limited. More research is missing in this direction.

The last aspect that we analyzed to assess the differences in the test objective is the use of robustness testing on unit and integration level. We found that robustness testing is well researched and done on unit, integration, and system level. The last difference that we ana-lyzed qualitatively is the difference in test costs. We found that the current research on this topic cannot provide a definite answer to the question if unit or integration tests are more costly in their development, maintenance, and execution. Nevertheless, the analysis of the experience of developers highlighted, that integration tests are more costly. We are missing empirical studies with real software projects on this topic to come to a definite conclusion.

9.2. Outlook

There are several open problems and possible improvements that provide opportunities for future research. Our quantitative analysis can be improved in several ways. Overall, we could use more projects to assess if our results are generalizable. This would improve the external validity of our results and could provide further insights. Furthermore, our analysis could be done with other types of projects. Within this thesis, we focused on frameworks and libraries, but extending the focus to also include applications could give us interesting findings. Especially, as the tests of frameworks and libraries are rather different from tests for an application. Currently, we performed our analysis on open-source projects only, because we do not have industrial data available, but the use of industrial project data for the quantitative analysis could provide further insights. A comparison of the results of the quantitative analysis between open-source and industrial projects could present especially interesting results, as tests are often developed differently in an industrial context.

Future research could also include a manual analysis of test cases. This, in addition to our automated quantitative analysis, could provide insights into the reasons for our results. For example, we could manually analyze test cases and their (not) detected defects and defect

types. This could help us to understand why certain tests are not detecting certain defect types. These results could be used to develop tests that detect specific defect types.

Furthermore, we could perform our quantitative analysis on different releases of the same project. This way, we could assess how (and if) the results are changing during the evolution of a software. For example, it would be interesting to assess if there are more unit tests for a software in the beginning of the development in contrast to later versions. Moreover, checking when and if there is a transition of tests from a unit to an integration tests (or vice versa) could provide us with insights in software testing practices. The results from this analysis could help us to guide the evolution of software tests.

Another possible research direction is to assess, if the development paradigm has an influence on the distribution and quality of tests on the different test levels. For example, we could apply our approach to projects that follow the Test-Driven Development (TDD) paradigm and compare these results with results from other projects. This way we could assess the influence of TDD on the number of unit and integration tests, as well as its influence on, e.g., the effectiveness of these tests.

There are also several opportunities to improve our frameworks. Currently, our COMFORT framework is not able to differentiate integration from system tests. This could be addressed by developing a methodology to differentiate those test types from each other. One approach could be that the test is analyzed to evaluate if it is assessing the software through its main interface (e.g., the main method of a Java program). If this is the case, the test is most likely designed as a system test and as an integration test otherwise.

A further improvement could be the extension of our defect classification approach. In addition to the classification of the defects in several defect classes, a severity could be assigned to each defect. This way, we could include the severity of defects into our analysis to evaluate which defects with which severity unit and/or integration tests detect.

We were able to provide a qualitative analysis of the differences between unit and inte-gration tests regarding the test execution automation, the test objective, and the test costs.

Further work can be a quantitative evaluation of these aspects. This might not be easily possible, as data might not be available for open-source projects. For example, evaluating the costs of tests is only possible using industrial data. Furthermore, we could repeat our qualitative analysis using a different methodology (e.g., conducting a Systematic Literature Review (SLR) for each difference).

In addition to the above mentioned directions and research opportunities, we plan to per-form a developer study on unit and integration testing practices. We hope for feedback from developers and how they use unit and integration tests in their work. This feedback could help us to understand why developers classify their tests differently and not according to the definitions. The results of this study could lead to new definitions for unit and integra-tion tests, that better reflect the current development reality. In connecintegra-tion to this study, the usage of unit and integration tests in different development models or phases would be interesting to assess. We could compare the usage of both test types in a classical and agile development environment to gather information of their usage and the differences.

[1] P. Ammann and J. Offutt,Introduction to software testing. Cambridge University Press, 2016.

[2] Xi’an Jiaotong University, “ICST 2019,” http://icst2019.xjtu.edu.cn, 2018, [accessed 13-November-2018].

[3] Organizing Committee ISSTA 2019, “ISSTA 2019,” https://conf.researchr.org/home/

issta-2019, 2018, [accessed 13-November-2018].

[4] B. A. Kitchenham, T. Dyba, and M. Jorgensen, “Evidence-based software engineer-ing,” inProceedings of the 26th International Conference on Software Engineering (ICSE). IEEE Computer Society, 2004, pp. 273–281.

[5] T. Dyba, B. A. Kitchenham, and M. Jorgensen, “Evidence-based software engineer-ing for practitioners,”IEEE Software, vol. 22, no. 1, pp. 58–65, 2005.

[6] J. S. Molléri, K. Petersen, and E. Mendes, “Cerse-catalog for empirical research in software engineering: A systematic mapping study,” Information and Software Technology, 2018.

[7] S. Beecham, D. Bowes, and K.-J. Stol, “Introduction to the ease 2016 special section: Evidence-based software engineering: Past, present, and future,”

Information and Software Technology, vol. 89, pp. 14 – 18, 2017. [Online].

Available: http://www.sciencedirect.com/science/article/pii/S0950584917303877 [8] A. Bertolino, “Software testing research: Achievements, challenges, dreams,” in

Fu-ture of Software Engineering 2007. IEEE Computer Society, 2007, pp. 85–103.

[9] A. Spillner, T. Linz, and H. Schaefer,Software testing foundations: a study guide for the certified tester exam. Rocky Nook, Inc., 2014.

[10] G. J. Myers, C. Sandler, and T. Badgett,The art of software testing. John Wiley &

Sons, 2011.

[11] I. Sommerville,Software engineering, 9th ed. USA: Addison-Wesley Publishing Company, 2010.

[12] J. Ludewig and H. Lichter,Software Engineering: Grundlagen, Menschen, Prozesse, Techniken. dpunkt. verlag, 2013.

[13] B. Beizer,Black-box testing: techniques for functional testing of software and sys-tems. John Wiley & Sons, Inc., 1995.

[14] K. Beck, Test-driven development: by example. Addison-Wesley Professional, 2003.

[15] P. Bourque, R. E. Fairleyet al.,Guide to the software engineering body of knowledge (SWEBOK (R)): Version 3.0. IEEE Computer Society Press, 2014.

[16] M. Ellims, J. Bridges, and D. C. Ince, “The economics of unit testing,” Empirical Software Engineering, vol. 11, no. 1, pp. 5–31, 2006.

[17] W. Royce,Software project management. Pearson Education India, 1999.

[18] S. Thummalapenta, T. Xie, N. Tillmann, J. De Halleux, and W. Schulte, “Mseq-gen: Object-oriented unit-test generation via mining source code,” inProceedings of the the 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering (ES-EC/FSE). ACM, 2009, pp. 193–202.

[19] K. Taneja and T. Xie, “Diffgen: Automated regression unit-test generation,” in Inter-national Conference on Automated Software Engineering (ASE). IEEE, 2008, pp.

407–410.

[20] G. Fraser and A. Zeller, “Generating parameterized unit tests,” inProceedings of the International Symposium on Software Testing and Analysis (ISSTA). ACM, 2011, pp. 364–374.

[21] A. Leitner, M. Oriol, A. Zeller, I. Ciupa, and B. Meyer, “Efficient unit test case min-imization,” inInternational Conference on Automated Software Engineering (ASE).

ACM, 2007, pp. 417–420.

[22] B. Van Rompaey, B. Du Bois, S. Demeyer, and M. Rieger, “On the detection of test smells: A metrics-based approach for general fixture and eager test,”IEEE Transac-tions on Software Engineering, vol. 33, no. 12, pp. 800–817, 2007.

[23] G. Meszaros,xUnit test patterns: Refactoring test code. Pearson Education, 2007.

[24] D. Xu, W. Xu, M. Tu, N. Shen, W. Chu, and C.-H. Chang, “Automated integration testing using logical contracts,”IEEE Transactions on Reliability, vol. 65, no. 3, pp.

1205–1222, 2016.

[25] R. Lachmann, S. Lity, M. Al-Hajjaji, F. Fürchtegott, and I. Schaefer, “Fine-grained test case prioritization for integration testing of delta-oriented software product lines,” inProceedings of the 7th International Workshop on Feature-Oriented Soft-ware Development. ACM, 2016, pp. 1–10.

[26] D. Holling, A. Hofbauer, A. Pretschner, and M. Gemmar, “Profiting from unit tests for integration testing,” inInternational Conference on Software Testing, Verification and Validation (ICST). IEEE, 2016, pp. 353–363.

[27] IEEE, “Systems and software engineering – vocabulary,” ISO/IEC/IEEE 24765:2010(E), pp. 1–418, 2010.

[28] International Software Testing Qualification Board, “International Software Testing Qualification Board Glossary,” http://www.astqb.org/glossary/search/unit, [accessed 13-November-2018].

[29] K. C. Dodds, “Write tests. Not too many. Mostly integration.” https://blog.

kentcdodds.com/write-tests-not-too-many-mostly-integration-5e8c7fff591c, 2017, [accessed 13-November-2018].

[30] J. O. Coplien, “Why most unit testing is waste,” http://rbcs-us.com/documents/

Why-Most-Unit-Testing-is-Waste.pdf, 2014, [accessed 13-November-2018].

[31] ——, “Seque,” http://rbcs-us.com/documents/Segue.pdf, 2014, [accessed 13-November-2018].

[32] E. Kiss, “Lean testing or why unit tests are

worse than you think,” https://blog.usejournal.com/

lean-testing-or-why-unit-tests-are-worse-than-you-think-b6500139a009, 2018, [accessed 13-November-2018].

[33] M. Sustrik, “Unit test fetish,” http://250bpm.com/blog:40, 2014, [accessed 13-November-2018].

[34] Oracle, “Java API: Instrumentation,” https://docs.oracle.com/javase/8/docs/api/java/

lang/instrument/package-summary.html, [accessed 13-November-2018].

[35] O.-J. Dahl, E. W. Dijkstra, and C. A. R. Hoare,Structured programming. Academic Press Ltd., 1972.

[36] B. W. Boehm, “Software engineering: R&d trends and defense needs,” Research Directions in Software Technology, 1979.

[37] ——, “Verifying and validating software requirements and design specifications,”

IEEE Software, vol. 1, no. 1, p. 75, 1984.

[38] F. Trautsch and J. Grabowski, “Are there any unit tests? an empirical study on unit testing in open source python projects,” in International Conference on Software Testing, Verification and Validation (ICST). IEEE, 2017, pp. 207–218.

[39] G. K. Thiruvathukal, K. Läufer, and B. Gonzalez, “Unit testing considered useful,”

Computing in Science and Engineering, vol. 8, no. 6, pp. 76–87, 2006.

[40] M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. Le Traon, and M. Harman, “Mutation testing advances: an analysis and survey,”Advances in Computers, 2017.

[41] A. J. Offutt, A. Lee, G. Rothermel, R. H. Untch, and C. Zapf, “An experimental deter-mination of sufficient mutant operators,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 5, no. 2, pp. 99–118, 1996.

[42] M. Papadakis, C. Henard, M. Harman, Y. Jia, and Y. Le Traon, “Threats to the va-lidity of mutation-based test assessment,” inProceedings of the 25th International Symposium on Software Testing and Analysis (ISSTA). ACM, 2016, pp. 354–365.

[43] M. Papadakis, Y. Jia, M. Harman, and Y. Le Traon, “Trivial compiler equivalence:

A large scale empirical study of a simple, fast and effective equivalent mutant de-tection technique,” inProceedings of the 37th International Conference on Software Engineering (ICSE). IEEE Press, 2015, pp. 936–946.

[44] A. Estero-Botaro, F. Palomo-Lozano, and I. Medina-Bulo, “Quantitative evaluation of mutation operators for ws-bpel compositions,” in International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 2010, pp. 142–150.

[45] A. Derezinska and K. Kowalski, “Object-oriented mutation applied in common in-termediate language programs originated from c,” in International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 2011, pp.

342–350.

[46] P. Delgado-Pérez, I. Medina-Bulo, F. Palomo-Lozano, A. García-Domínguez, and J. J. Domínguez-Jiménez, “Assessment of class mutation operators for c++ with the mucpp mutation system,” Information and Software Technology, vol. 81, pp. 169–

184, 2017.

[47] S. Mirshokraie, A. Mesbah, and K. Pattabiraman, “Efficient javascript mutation test-ing,” in International Conference on Software Testing, Verification and Validation (ICST). IEEE, 2013, pp. 74–83.

[48] L. Deng, N. Mirzaei, P. Ammann, and J. Offutt, “Towards mutation analysis of an-droid apps,” inInternational Conference on Software Testing, Verification and Vali-dation Workshops (ICSTW). IEEE, 2015, pp. 1–10.

[49] L. Deng, J. Offutt, P. Ammann, and N. Mirzaei, “Mutation operators for testing an-droid apps,”Information and Software Technology, vol. 81, pp. 154–168, 2017.

[50] R. Abraham and M. Erwig, “Mutation operators for spreadsheets,” IEEE Transac-tions on Software Engineering, vol. 35, no. 1, pp. 94–108, 2009.

[51] T. Loise, X. Devroey, G. Perrouin, M. Papadakis, and P. Heymans, “Towards security-aware mutation testing.” inInternational Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2017, pp. 97–102.

[52] D. B. Brown, M. Vaughn, B. Liblit, and T. Reps, “The care and feeding of wild-caught mutants,” in Proceedings of the 11th Joint Meeting on Foundations of Soft-ware Engineering. ACM, 2017, pp. 511–522.

[53] B. J. Garvin and M. B. Cohen, “Feature interaction faults revisited: An exploratory study,” in International Symposium on Software Reliability Engineering (ISSRE).

IEEE, 2011, pp. 90–99.

[54] F. Belli, M. Beyazit, T. Takagi, and Z. Furukawa, “Mutation testing of" go-back"

functions based on pushdown automata,” in International Conference on Software Testing, Verification and Validation (ICST). IEEE, 2011, pp. 249–258.

[55] R. Gopinath and E. Walkingshaw, “How good are your types? using mutation analy-sis to evaluate the effectiveness of type annotations,” inInternational Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 2017, pp.

122–127.

[56] R. Jabbarvand and S. Malek, “µdroid: an energy-aware mutation testing framework for android,” inProceedings of the 11th Joint Meeting on Foundations of Software Engineering. ACM, 2017, pp. 208–219.

[57] P. Arcaini, A. Gargantini, and E. Riccobene, “Mutrex: A mutation-based generator of fault detecting strings for regular expressions,” in International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 2017, pp.

87–96.

[58] M. Papadakis and N. Malevris, “An empirical evaluation of the first and second order mutation testing strategies,” inInternational Conference on Software Testing, Verifi-cation and Validation Workshops (ICSTW). IEEE, 2010, pp. 90–99.

[59] L. Zhang, S.-S. Hou, J.-J. Hu, T. Xie, and H. Mei, “Is operator-based mutant selec-tion superior to random mutant selecselec-tion?” inProceedings of the 32nd International Conference on Software Engineering (ICSE). ACM, 2010, pp. 435–444.

[60] R. Gopinath, A. Alipour, I. Ahmed, C. Jensen, and A. Groce, “How hard does mu-tation analysis have to be, anyway?” inInternational Symposium on Software Relia-bility Engineering (ISSRE). IEEE, 2015, pp. 216–227.

[61] M. E. Delamaro, J. Offutt, and P. Ammann, “Designing deletion mutation operators,”

inInternational Conference on Software Testing, Verification and Validation (ICST).

IEEE, 2014, pp. 11–20.

[62] A. Siami Namin, J. H. Andrews, and D. J. Murdoch, “Sufficient mutation operators for measuring test effectiveness,” inProceedings of the 30th International Confer-ence on Software Engineering (ICSE). ACM, 2008, pp. 351–360.

[63] M. E. Delamaro, L. Deng, V. H. S. Durelli, N. Li, and J. Offutt, “Experimental eval-uation of sdl and one-op mutation for c,” inInternational Conference on Software Testing, Verification and Validation (ICST). IEEE, 2014, pp. 203–212.

[64] R. H. Untch, A. J. Offutt, and M. J. Harrold, “Mutation analysis using mutant schemata,” inACM SIGSOFT Software Engineering Notes, vol. 18, no. 3. ACM, 1993, pp. 139–148.

[65] M. Papadakis and N. Malevris, “Automatic mutation test case generation via dynamic

Im Dokument An Analysis of the Differences between Unit and Integration Tests (Seite 168-0)