acm - an acm publication

Articles

Workings of science
How software engineering research became empirical

Ubiquity, Volume 2022 Issue July, July 2022 | BY Walter Tichy


Full citation in the ACM Digital Library  | PDF


Ubiquity

Volume 2022, Number July (2022), Pages 1-7

Ubiquity Symposium: Workings of science: How software engineering research became empirical
Walter Tichy
DOI: 10.1145/3512339

Software engineering was recognized as its own type of engineering in the 1960s. At first, the tools and guidelines developed for it were mostly based on common sense, intuition, and personal experience, but not empirical evidence. It took until the late 1990s for researchers in the area to embrace empirical methods. This article is a personal story of how I experienced the maturing of Software Engineering research into an evidence-based science. I will interpret this development using two competing philosophical concepts, rationalism and empiricism, and describe how pragmatism reconciles them.

I started to study computing in 1971. At that time, computer science was a young and exciting new area. The problem with being young was that many traditional scientists were skeptical whether it deserved the label "science." A widely accepted definition by Newell, Perlis, and Simon was "Computer science is the study of the phenomena surrounding computers.'" I found it somewhat unsatisfactory, because it did not say what phenomena were worth studying. Coming from Europe, I thought informatics was a better term, because it emphasizes information processing rather than the instrument that does the processing. Regardless of terminology, researchers in this new area were busy figuring out what computers could do. Computers certainly could calculate. But could they also communicate with each other? Could they do medical diagnoses? Typeset? Play decent chess? Interpret an image? Behave intelligently? Those capabilities were among the phenomena to be studied.

I decided to pursue my Ph.D. in software engineering at Carnegie Mellon University (CMU). My thesis proposed a language for assembling software systems from multi-versioned components. It was called a "module interconnection language" the fancier term "software architecture" hadn't been invented yet. But I had a nagging doubt: Would my new language truly help programmers? My advisor at CMU was against empirical studies. His view was that one should use logical arguments to evaluate solutions to problems. I had no choice but to go along with that. Basically, I argued why a language like mine was necessary, what it needed to say, and why my language constructs were the best among the alternatives I could think of. An implementation demonstrated that the language enabled version control and automatic checking of interfaces. My research attracted a fair amount of interest. Even before finishing my thesis, I was hired as an assistant professor at Purdue University (which almost kept me from finishing). All of that was great, but doubt remained. Was I really contributing to knowledge, and if so, was this knowledge reliable? Was my carefully argued language useful in real-world software development?

I wanted to do research whose results could be trusted. Software researchers were inventing new tools, languages, and methods to make the work of programmers easier. Soon it was becoming impossible to say which approach was best for which task. Differences between the many new tools and languages were by no means obvious. The question of whether a new tool actually helped programmers could not be answered by argumentation. But researchers were doing exactly that—presenting new or improved tools and arguing about their superiority. In a 1995 bibliographic study I found 50% of software engineering papers totally lacked empirical validation [1]. Few researchers seemed to realize that advocacy and appeal to authority (their own or someone else's) were not scientific methods

My aspiration for dependable statements about truth and usefulness was nudging me toward experimentation. But there were hardly any examples for how to go about that and experimenting with teams of developers on real projects seemed prohibitive.

But then a pivotal moment occurred. I had developed a method called "Smart Recompilation" that prevented unnecessary recompilations after changes. The state of the art was a program called "Make" that used dependencies among files to avoid redundant compilations. By refining the dependencies down to individual declarations I was able to reduce compilations to the absolute minimum. Reducing compilation work was important, because recompiling software could take hours and even days, slowing down software development teams. At one point, I confessed to a colleague I was not sure whether my new technique would actually make a difference in real projects. After some thought, my colleague suggested trying it out on a real project that had a version history. Through a consulting job, I just happened to have access to the version history of a sizable industrial project. Every update of every file had been recorded with a versioning tool called RCS. Today, projects routinely use such tools to record project histories, but not then. I was extremely lucky to get that data. It enabled me to replay the compilation history of the entire project and determine the files recompiled by Make, Smart Recompilation, and some other techniques. I finally had numbers, and they showed that half of Make's recompilations were redundant. Now readers could decide which technique to use, and whether a more sophisticated one was worth the trouble. Nobody had to take my word for it—and there was no need to argue.

Although this was only a case study, it was controlled: Make acted as the control because we reproduced exactly what a real control group of programmers using Make would have experienced. However, as a case study, it analyzed only one project in a single language. For generalizability, additional studies would be needed. The advantage of using a repository was that the study didn't require human subjects—the data were sufficient. It turned out the study was the first example of analyzing a software repository. A whole conference series about mining software repositories came into existence later, as the value of project data was recognized.

Unfortunately, not all experiments can be done with pre-existing data. For instance, when mining software repositories, it is impossible to vary parameters, such as which testing methods to use, in a systematic and balanced way. Collecting or producing data to analyze is often the major part of empirical work. If one relies on existing repositories, one has to accept the data as is. This is why studies of software repositories are called ex-post-facto studies.

The recompilation study was pivotal, because it showed me empirical studies could yield results that one could rely upon. I went on to do controlled experiments (with human subjects) on type checking, inheritance, design patterns, assertions, pair programming, test-driven development, requirements translation, programming in natural language, and others. I studied the empirical methods used in sociology and began to teach a course on empirical software engineering. Empirical evaluation is now ingrained in all my research. Fortunately, I'm not alone. Many other researchers recognized the need for a scientific approach to software research. With few exceptions, conferences and journals in software engineering now require empirical validation. There even is a journal called Empirical Software Engineering.

But the road there was difficult. Often, reviewers did not know how to evaluate empirical work and rejected papers for the wrong reasons. For instance, they thought empirical studies were uninteresting because they provided no new tools. Wasn't a new tool much more important and exciting than trying out old ones? Also, experiments are never perfect because they occur in the real world, but reviewers would ask to redo entire experiments because of minor flaws, not understanding how difficult it is to recruit participants and how long it takes to run an experiment. In conference committees, I found myself fighting for empirical papers and often lost. Another sign of immaturity was that replications of experiments were rare (they still are). When my students and I replicated an experiment about inheritance depth, we obtained results that contradicted the earlier ones. My expectation was that the journal that published the original study would immediately publish ours, to make sure that wrong or uncertain results would be corrected. I was wrong! The paper was rejected. We even had an explanation for the different results, but the editor didn't want to see that. (With much delay, the paper was eventually published elsewhere.) At one time I was even booed at a conference and a third of the audience left, because I was arguing for more experimentation. In hindsight, this tension can be explained with the conflict between rationalist and empiricist viewpoints.

RATIONALISM AND EMPIRICISM IN SOFTWARE ENGINEERING RESEARCH

I will now use the philosophical concepts of rationalism and empiricism to interpret the development outlined above. Rationalism and empiricism are two opposing views about epistemology, or how to accumulate knowledge, going back to the Greek philosopher Plato (a rationalist) and his pupil Aristotle (an empiricist). The rationalist asserts all knowledge comes from three sources: intuition, innate knowledge, or logical deduction from intuited propositions. Knowledge accumulated this way is independent of sensory experience and superior to knowledge gained from observation.

The empiricist, on the other hand, claims knowledge comes primarily from sensory experience. Empiricism stresses the role of evidence in the discovery of new concepts, rather than innateness or intuition. Theories stated by rationalists are hypotheses to empiricists, to be tested and justified by observation.

The rationalist and empiricist viewpoints seem to be incompatible and were actually seriously debated, beginning back in the times of Plato and Aristotle. During the 19th and early 20th century, a new philosophy, the philosophy of pragmatism, arose. It held that rationalism and empiricism are complementary. The change in viewpoint has to do with the need for theory, i.e., explanation, and its justification. Suppose we have two different ways for producing software, A and B, and we have ascertained by experiment that A is superior to B. Along comes a new method C. Do we need to redo all these experiments, comparing A to C and B to C? An underlying theory that explains the superiority of B, and which was implicitly tested when we compared A and B, might help find the proper place for C, and perhaps other methods in the future. But where does the theory come from? The theory typically comes from thoughtful observation, intuition, or deduction. Without theory, pure empiricism has difficulties constructing a coherent world view; all we get are scattered observations, without overarching explanations. Pragmatism reconciles these two views: theory is welcome regardless of whether it comes from intuition, deduction, or observation but experimental tests are required to check whether a theory holds.

But how many tests need to be done? This question was clarified in 1934 by the Austrian philosopher Karl Popper and his concept of falsificationism. Basically, falsificationism says that theory must be falsifiable. Experiments actually try to falsify theories, rather than prove them. A theory is always provisional because it might be falsified by the next experiment. The longer it withstands falsification attempts, the more we can trust it. When a theory is eventually falsified, the search for a better one can begin, and that search may result in significant progress for science.

After this short excursion into epistemology, I can now illustrate what an empiricist was up against in the 1980s and '90s, namely the establishment of the rationalists. My advisor's attitude was rationalist, because he encouraged me to use logical arguments to evaluate my work. This is not surprising, since he was trained as a mathematician, and mathematicians are rationalists (and properly so). Many of the founders of computer science were mathematicians; obviously, they were more familiar with logical deduction than experiment. For example, David Parnas, a pioneer in computer science, takes a rationalist viewpoint when writing "what can be learned from empirical studies, while important, is very limited" [2].

In my experience, rationalists would often exclaim "I knew this all along" or "I could have told you so" when presented with experimental evidence. However, not everything that was thought to be correct turned out to be so. A number of tenets in software engineering failed when tested, for example the reliability of n-version programming was greatly overestimated, pair programming is not as advantageous as claimed, and object-oriented models of software are not helpful during maintenance. (These topics deserve more discussion in another essay.) In the seventies I heard engineers say something like: "We built this computer, so we know exactly what it does and there is no need for experiment to figure out how it works". This argument didn't last, because computers became extremely complex owing to pipelines, caches, and other performance features. Only experiments with benchmarks could yield useful comparisons among computer architectures. Slowly but surely researchers began to see that it was necessary to back up claims with evidence.

From my experience in program committees I can say rationalist attitudes were shared by many reviewers. Some argued software engineering experiments were too expensive, useless, even harmful. (Excuses like these have been debunked [3]). It took about 25 years for research in software engineering to become evidence based. Obviously, it took time to learn and adapt experimental methods to software engineering. There are lots of variables to control, such as knowledge and experience of subjects, familiarity with different software types, size of the software, students versus practitioners, and many others. But with time, researchers found ways to handle these difficulties and tackle more and more complex questions. Today, we have a new generation of computer scientists for which empirical studies are the norm. Research has reached an equilibrium between rationalism and empiricism called pragmatism. This is not to say that everything is perfect. New problems include a lack of replication of experiments, the file drawer effect (negative results not getting published), p-value hacking (massaging the data until statistically significant results are obtained), and making up hypotheses after the results are known (rather than stating hypotheses at the outset and then checking them). But the need for empirical studies is no longer questioned.

Acknowledgements

Peter Denning's and Robert Akscyn's remarks greatly improved this essay. Thank you both!

For more information about rationalism, empiricism, pragmatism, and falsifiability consult Wikipedia.

References

[1] Tichy, W. F., Lukowicz, P., Prechelt, L., and Heinz, E. A. Experimental evaluation in computer science: A quantitative study. Journal of Systems and Software 28, 1 (1995), 9–18; doi: https://doi.org/10.1016/0164-1212(94)00111-Y.

[2] Parnas, D. L. The limits of empirical studies of software engineering. In Proceedings of the 2003 International Symposium on Empirical Software Engineering (ISESE'03). IEEE, 2003, 2–5.

[3] Tichy, W. F. Should computer scientists experiment more? Computer 31, 5, (1998), 32–40; doi: 10.1109/2.675631.

Author

Dr. Walter Tichy has been professor of computer science at Karlsruhe Institute of Technology in Karlsruhe, Germany, since 1986. His research interests include software engineering, parallel computing, and artificial intelligence. He is best known for his work in software configuration management and empirical studies of programmers. Before Karlsruhe, he was an assistant professor at Purdue University. He holds a Ph.D. in computer science from Carnegie-Mellon University. In his spare time, he plays the grand piano.

2022 Copyright held by the Owner/Author.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.

COMMENTS

POST A COMMENT
Leave this field empty