Articles
Ubiquity
Volume 2022, Number June (2022), Pages 1-11
Ubiquity Symposium: Workings of science: Debunked software theories
Walter Tichy
DOI: 10.1145/3512338
Falsifiability is a cornerstone of science. It states that scientific claims—propositions, hypotheses, theories—must be testable by experiment. A scientific claim is falsified if an empirical test contradicts it; if a claim withstands repeated attempts at falsification, it is accepted as fact. This article discusses three examples of falsified theories about software. They address the reliability of multi-version programs, the prediction of program bugs by means of software metrics, and the advantages of software models (UML). These examples demonstrate how falsifiability can eliminate incorrect theories and help reorient research and practice.
The importance of science in people's lives is growing. Yet trust in science seems to be decreasing. Falsifiability is a fundamental principle of science. It was introduced in 1934 by philosopher Karl Popper in his book The Logic of Scientific Discovery. Popper maintained that scientific claims—propositions, hypotheses, theories, models, assumptions—be posed in such a way that they are open to being proved false when tested. The reason is that universally quantified claims can usually not be verified; they can only be falsified. For example, the proposition "All swans are white" is not verifiable, because that would require finding all swans and checking their plumage. Even if we recruited thousands of people all over the world to help, we would not be able to check future and dead swans. However, a single counterexample can discredit the proposition. Indeed, a species of black swans (Cygnus atratus) inhabits Australia. Its existence refutes the proposition. Falsifiability is used to weed out incorrect scientific claims. For a claim to be accepted, it must withstand falsifiability attempts.
We will look at three software theories whose claims were tested with experiments. The first theory says multi-version programming produces far fewer errors than single-version programs. Its falsification has significant implications for program verification. The second theory claims certain software metrics predict defects in software modules. Its falsification casts doubts on whether metrics from one software project provide insight into defects of another project. The third theory claims using UML models (graphical representations of software architectures) lead to fewer software defects. Its falsification casts doubt on the utility of UML models. The three examples illustrate how the scientific process tests theories with experiments and debunks false claims. Besides being a catalyst for searching for better theories, falsification also helps practitioners free themselves from unworkable guidelines and techniques.
MULTI-VERSION PROGRAMMING
Multi-version programming is a method for software development where multiple versions of the same specification are implemented several times by independent development teams, perhaps even in different languages. The versions are executed simultaneously on the same input. If the outputs are identical then all is well. If they differ, then a majority vote determines the result. It is claimed that this approach, though expensive, improves reliability dramatically. But by how much? Typically, one uses three versions. Let's assume the failure probability p is the same for all three versions. Then the probability that all three versions fail on the same input is p3. For example, if p = 10−4, then all three versions will fail simultaneously once in a billion inputs, which is truly rare. However, that's not the only way a voting ensemble of three can fail. It is also possible that two versions fail, which would also produce the wrong result. That probability is 3 × p2(1 − p). Taking these two probabilities together for p = 10−4, we would expect a failure in approximately 30 million inputs. That's not bad at all. A user of a social network that is programmed in triple redundancy would experience a failure every 30 million posts. Even the most prolific poster would be unlikely to see a posting fail in a lifetime. No wonder a lot of research went into this topic. The software for commercial airliners, nuclear reactors, and other safety critical applications was programmed that way. A tiny p could produce ultra-high reliability. One might even skimp on testing individual versions (and thereby reduce development costs) and still achieve respectable reliability.
But is this theory about the growth of reliability correct? It rests on a crucial assumption, and that is the versions fail independently. Is it truly the case that a double failure on the same input happens with a probability of p2? Doubts had accumulated because some program parts are more difficult to write than others. Wasn't it more likely for programmers to introduce bugs in the more difficult parts? Would that cause several versions to fail on the same input? In an exemplary experiment, Nancy Leveson and John Knight tested the assumption of independent failures for multi-version programs [1].
The experiment went as follows. Knight and Leveson chose the launch interceptor problem for testing. This program receives a set of radar reflections as input and produces an output vector of 241 Boolean values. The specification for that problem was available. In a preliminary study, all ambiguities in the specification had been removed. The experimenters recruited 27 students (18 from the UC Irvine and nine from the University of Virginia), who implemented the specification. There was no interaction among the students during this work. The students received 15 test cases. In addition, acceptance tests were run with 200 randomly chosen test cases. Every time the acceptance test was run, a new test set was randomly generated, to avoid selective filtering of bugs. The idea was to achieve high-quality implementations with small probabilities of failure. The implementations were between 300 and 1,000 lines of code.
Now came the crucial step. The goal was to count failures on the same input. After an implementation passed the acceptance test, it was subjected to one million test cases. For each test case, it was recorded whether the program failed or not. Failure meant that one of the 241 Boolean values was wrong. To handle that many tests, the experimenters thoroughly inspected and tested a "gold program" to make sure it was correct. Every departure from the gold program's output was counted as a failure.
It turned out the implementations were highly reliable, with an average reliability of 99.93%. Six versions had no bugs at all. But there were a total of 1,255 coincident failures. This means two or more programs failed on 1,255 test cases. For instance, there were 551 test cases on which two implementations failed. Up to eight versions failed on the same input.
Now the question was whether 1,255 coincident failures were to be expected, given the reliability of the versions. To answer this question, the experimenters developed a probability distribution that allowed the calculation of the probability of two or more versions failing under one million tests. This distribution was built on the assumption that programs fail independently. It turned out that 1,255 coincidental failures were extremely unlikely. The independent failure assumption could be rejected at the 99% confidence level. Since the probability model assumed independent failures, this assumption was likely wrong.
In hindsight, there is a logical explanation for this: Programmers probably make similar mistakes, particularly when the program logic gets complicated. This does not mean multi-version programming is useless. It might increase reliability, but not to the degree originally thought. So far, no satisfactory theory has emerged that permits an accurate assessment of the gain of reliability of multi-version programming.
Strictly speaking, the result by Knight and Leveson holds only for the launch interceptor program. But according to Karl Popper's philosophy of falsificationism, a single counter example is enough to reject a universally quantified theory. (The theory of independent failures is universally quantified, because it applies to all independently constructed programs for any specification.) The paper by Knight and Leveson is a classic rejection of a theory by a single, well-constructed experiment. It punched a huge hole into the beliefs about multi-version programming.
Sadly, that was not the end of the story. A professor at UCLA, one of the founders of multi-version programming, and several of his students began attacking the paper. In their publications, they misrepresented the results or claimed the experiment was faulty. For instance, they claimed there was inadequate testing and the individual reliability of the program versions was unsatisfactory, even though the Knight and Leveson versions had far better reliability than in other experiments. They also claimed the rules of independent development were not followed properly, even though some of the experimental subjects were separated by 3,000 miles. Knight and Leveson personally contacted the purveyors of multi-version programming, but the misrepresentations did not stop. So they decided to publish an article to set the record straight [2]. This article disproves the criticisms one by one. It makes for fascinating reading and is a rare case of documenting the attempt at "survival" of a falsified theory. Knight and Leveson say it best: "Attacking one experiment … does not make N-version programming a reasonable way to ensure the safety of software."
We can only speculate why scientists attacked an extremely well-done experiment. We know fault tolerant computing was their research area, including multi-version programming. Perhaps they felt their research topic and funding were threatened. Perhaps it was a case of sunk costs: The researchers had already spent so much time and effort on multi-version programming that they were unwilling to give it up. Giving up could also be seen as an embarrassment. What we learn from all this is that research is a human endeavor and therefore fallible; it is not as perfect and objective as it is presented in textbooks. However, the remarkable thing about the scientific process is that gross errors and even fraud are eventually discovered and corrected, because researchers check each other's results.
The result by Knight and Leveson has consequences for another area, and that is program verification. In program verification, one constructs a mathematical proof that a program faithfully implements a given specification. The specification has to be written in a formal language to enable proof techniques. Obviously, there are two versions: a program and its formal specification. There may be flaws in both. The proof itself can be seen as a third version (which may also contain bugs). The proof, if correct, guarantees that the program and specification agree on all inputs. Thus, verification makes sure there are only coincident bugs left. It follows that no voting at runtime is necessary. What does the result by Knight and Leveson mean for program verification?
Research in program verification began in the 1960s and still is an active area today. Proponents claim verification will result in flawless programs. But this is not the case: An error in the specification will be faithfully replicated by the program, leading to a coincident failure. What's more, there may be errors in the proof or the proving program, causing more coincident failures. From the results of Knight and Leveson we now know these coincident failures are more likely than expected. Program verification may improve reliability, but not to the point of zero bugs. It is merely an instance of three-version programming.
SOFTWARE METRICS
A software metric is a standard for measuring software systems or software production processes. The simplest metric is lines of code (LOC), which is used to measure the size of the software by counting its source code lines. If we know the number of defects in the program, we can determine its defect density by dividing the number of defects by LOC. There are dozens of software metrics; the English Wikipedia entry lists 28, but that list is far from complete. Software metrics are used for cost estimation, quality assurance, test prioritization, debugging, performance optimization, and others. They are difficult to define objectively and meaningfully. For instance, what should count as a line of code in the LOC metric? We can perhaps agree that empty lines and pure comment lines should not count (although comment density might be a quality metric). Some programmers pack a lot into one line, while others prefer an "airy" layout style. To compensate for that, we might need to define a layout standard to go with the metric. But what does LOC tell us? The intuitive notion is that a long program takes more time to write and contains more bugs than a short one. But LOC does not consider the complexity of the individual lines (a looping statement is certainly more complex than the declaration of an integer variable), nor does it take into account the overall complexity of an algorithm. So it is entirely possible that a short program takes longer to write and contains more bugs than a longer one. And then there is the possibility of LOC bloat: Programmers might add dummy lines to inflate their code, if their productivity is measured in LOC.
We will now explore a specific question, namely whether software metrics can be used to predict defect-prone software modules. This prediction would be helpful, because developers could then allocate more of their testing time and budget to those modules that are defect-prone and thereby improve the overall quality of the software.
An early attempt at predicting properties of software are the complexity metrics by Halstead. Maurice Halstead claimed as early as 1977 that by counting a program's operators and operands one could predict the number of delivered bugs [3]. The research area was dubbed "Software Science," but its predictions were mostly wrong, so researchers quietly gave up on the idea. One of the difficulties at the time was that there was no data publicly available on real software projects and their defects.
In the years following, numerous papers about new and supposedly better metrics were published, because measuring properties of the software remained an important theoretical and practical problem. In 2006, along came a study by Nagappan et al. that tested whether software complexity metrics could predict defect-prone modules [4]. This time, they used massive amounts of data: five Microsoft projects with a total of more than one million LOC, along with the projects' complete development and defect histories. Moreover, the authors explored not just their favorite metric, but a total of 18 metrics from LOC to the fan-in/fan-out of calls to object-oriented metrics such as inheritance depth. At the time, that was one of the largest studies of software repositories. The researchers decided to analyze defects after the software had been released, because failures that show up after release are the only ones that matter to customers.
In principle, the analysis was straightforward, but involved a lot of data. For every project, the authors first collected all the faults reported by customers within the first six months after release. Next, each fault was mapped to the module(s) that contained the fault-generating defect. This information was available in the database for tracking bug fixes. Now the researchers were able to compute the number of post-release defects per module. Next, they computed the 18 software metric values for every module. Finally, they correlated the number of post-release defects with the 18 metric values.
When computing correlations, one also gets a value for statistical significance. By convention, a correlation is considered statistically significant if that value is 0.05 or less. Not all correlations were statistically significant at the 0.05 level. More importantly, every project required different metrics to reach statistical significance. For one project, only LOC was significant, for another all metrics but one worked, for a third, the number of classes and related metrics provided significance, but nothing else. There was no metric that achieved statistical significance for all projects.
The next step was to build an actual predictor. The authors chose linear regression using a combination of metrics. The most effective metrics were selected with a technique called principal component analysis. The predictors were constructed by randomly splitting the data set of each project into ⅔ and ⅓, where the larger part was used to build a regression model and the smaller part to predict. The predictors constructed in that fashion identified the most defect-prone modules fairly well, but only on the project on which they were trained.
Each predictor was trained and evaluated on the same project. In practice, such a predictor is useless, because in order to train the predictor, the project must have been released and defects collected—which means, we already have the defects of all modules and won't need a predictor. The story becomes more interesting, however, if we train the predictor on a released project and use it on a second project that is still under development.
The authors indeed tested cross-project defect prediction with the five projects they had, yielding 20 project pairs. The result was that prediction was only successful for projects with similar defect distributions (two pairs), but not for any other pair. So while it was possible to build a predictor after the fact, predictors were accurate only when obtained from the same or similar projects—a discouraging result for software metrics.
A later study with an even larger software base examined cross-project defect prediction more broadly [5]. The authors analyzed eight commercial and four open-source systems, some of them in multiple versions, resulting in 28 data sets (more than 35 million LOC). The metrics used were the familiar LOC, the total number of components, the number of edits, the cyclomatic complexity metric, the number of developers, the number of bugs fixed during development, and the number of defects found after release. They also included code churn measures, which are the number of lines added, deleted, or changed, normalized by total LOC. Both the code base and the set of metrics were much broader than the previous study, though the metrics were not as numerous.
As before, the goal of the study was to predict defect-prone modules by building a predictor from one project and using it on all other projects. There were 622 usable cross-product pairs, but there were only 21 predictions that had recall, precision, and accuracy above 75%. This is an abysmally low success rate of 3.4%. Simply flipping a coin might have fared better.
The authors explored whether the success rate could be improved by identifying commonalities, such as domain, producer, or open source or not, but the results were mixed. Using one project to predict defect-prone modules in another rarely worked. Apparently, cross-project defect prediction with current software metrics is not possible. It might be possible to build project-specific predictors using data from the development phase, such as defects detected during tests. Studies have suggested organizational structure plays a role: Size and quality of the team, breadth and depth of the organizational structure, assignment of modules (difficult ones assigned to the best developers), even the number of engineers that left the project, all of this might have an effect. A project-specific predictor, however, is unsatisfactory. It won't allow us to derive general insights or provide recommendations to programmers for building better software.
SOFTWARE MODELING: UML
The Unified Modeling Language (UML) provides standardized graphical notations for describing the overall structure (the "architecture") of a software system and its behavior. It is taught in virtually all software engineering courses. UML diagrams serve as blueprints for software to be built. UML's main purpose is to document and communicate the result of design activities. UML can also help maintainers understand the software by providing an overview. The experiment discussed here addresses the second aspect: The cost and benefits of UML during maintenance.
The experiment by Dzidek et al. uses a classic design: Members of an experiment group and a control group are asked to perform the same tasks under two different conditions, and their performance is measured and compared [6]. The tasks were five maintenance problems to be solved on the same software. All participants received the source code and its documentation. The members of the experiment group, however, got additional information: UML diagrams describing the software's architecture and behavior (in terms of class diagrams for six packages and 16 sequence charts). The goal was to test the claim that UML diagrams help with modifying the software. The subjects with UML documentation also received a refresher course on UML and some training for an UML editor. What makes this experiment special is the participants were all professionals, and not, what usually happens, students. The professionals were hired from consulting companies and were paid an hourly rate for their work. They were selected for being experienced with Java, the programming language of the software, and UML.
The most interesting results concern the time to complete the maintenance problems and the correctness of the solution. It is important to note that work time and correctness are confounded, i.e., not independent. Obviously, shoddy work goes quickly, while careful work takes time. Thus, it makes no sense to compare completion times if quality differs. Similar to Knight and Leveson, the experimenters got around this problem by using an acceptance test, which all maintenance jobs had to pass. Programmers were allowed to submit to the acceptance test as often as they liked. In this way, the experimenters achieved comparable quality for all submissions.
The surprising result was the group with access to UML was only 1.4% faster than the control group. This tiny advantage was statistically insignificant, meaning it was probably due to chance. Furthermore, a difference of 1.4% is irrelevant for all practical purposes. In absolute terms, the average work time for the UML group was 33 minutes, only half a minute faster than the control group. What's missing from this measure is the time taken to update the UML diagrams; updating is usually part of the job, because otherwise the diagrams become useless. When adding the update time, the UML group actually took 14% longer (this was also not statistically significant).
A second result concerned correctness. We've already seen the acceptance test made sure correctness was comparable across submissions. The experimenters used a surrogate measure for correctness, and that was the number of acceptance test runs needed for passing. According to this metric, the UML group was 50% better, but it is unclear what the number of acceptance test runs means with respect to correctness. Programmers could write small increments and run the acceptance test frequently or complete all tasks carefully and then use the acceptance once and pass it (this actually happened). The results concerning correctness are questionable.
This experiment is not as decisive as the other two examples. There are some threats to validity, including the following. The software to be modified was relatively small (2,900 LOC). Perhaps UML is beneficial for larger projects. The program was well documented, so perhaps UML was not even needed. A small effect size combined with a small number of participants (10+10) leads to statistical problems. Also, some of the participants were not using UML properly. The time for acceptance testing was not included. Most important is that the result only applies to maintenance. Perhaps UML is more helpful when starting a new project, when no code exists and developers are drawing boxes and arrows to visualize the new system. UML might also provide a helpful abstraction for multi-platform development, say when programming two different mobile phones and little common code exists. None of these additional applications of UML have been tested in a controlled manner. While it is too early to abandon UML, the current experiment does raise serious doubts. For someone interested in ongoing research about the usefulness of UML, see the work by Scaniello [7].
THREE BLACK SWANS
A black swan is a metaphor for a surprise event. All three experiments were surprises. The first one on multi-version programming is a classic case of falsification. It essentially terminated research on this topic, and thereby allowed work to go into more fruitful areas. It also raised troubling questions regarding program verification. The second study, predicting faults with metrics, used data from more than 30 projects, but failed to provide evidence that current metrics are suitable cross-project defect predictors. Research on questions of software quality and productivity continues, but better measuring devices are needed. The UML experiment, finally, is not a complete falsification, but raises serious doubts. This uncertainty leaves us in a state of suspense. If no beneficial use of UML is found, it will eventually fall by the wayside.
References
[1] Knight, J. C. and Leveson, N. G. An experimental evaluation of the assumption of independence in multiversion programming. IEEE Trans. Softw. Eng. 12, 1 (1986), 96–109. https://doi.org/10.1109/TSE.1986.6312924
[2] Knight, J. C. and Leveson, N. G. A reply to the criticisms of the Knight & Leveson experiment. SIGSOFT Softw. Eng. Notes 15, 1 (1990), 24–35.
[3] Wikipedia contributors. Halstead complexity measures. Wikipedia, The Free Encyclopedia. 2022.
[4] Nagappan, N., Ball, T, and Zeller, A. Mining metrics to predict component failures. In Proc. 28th International Conference on Software Engineering ACM, New York, 2006, 452–461.
[5] Zimmermann, T., Nagappan, N., Gall, H., Giger, E., and Murphy, B. Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering (ESEC/FSE '09). ACM, New York, 2009, 91–100; https://doi.org/10.1145/1595696.1595713.
[6] Dzidek, W., Arisholm, E., and Briand, L. A realistic empirical evaluation of the costs and benefits of UML in software maintenance. IEEE Trans. Softw. Eng. 34, 3, (2008), 407–431.
[7] Scaniello, G., et al. Do software models based on the UML aid in source-code comprehensibility? Aggregating evidence from 12 controlled experiments. Empir Software Eng 23, 2695–2733 (2018).
Author
Dr. Walter Tichy has been professor of computer science at Karlsruhe Institute of Technology in Karlsruhe, Germany, since 1986. His research interests include software engineering, parallel computing, and artificial intelligence. He is best known for his work in software configuration management and empirical studies of programmers. Before Karlsruhe, he was an assistant professor at Purdue University. He holds a Ph.D. in computer science from Carnegie-Mellon University. In his spare time, he plays the grand piano.
2022 Copyright held by the Owner/Author.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.
COMMENTS