In this discussion, we do not consider the issue of eliminating software bugs; rather, ee assume that the code is correct and the faulty behavior is only due to transient faults affecting the system. The software fixes here aims to supplement the intrinsic error detection mechanism of a processor. Fault tolerance is the ability of an application system to perform its function correctly even in the presence of internal faults. Today, fault tolerance (FT) is a much researched subject. Essentially, a system fails when it cannot meet its promises. Transient fault occurs once and then disappears. An intermittent fault occurs, then vanishes of its own accord, and then reappears and so on at irregular intervals. Many of us in our society cannot always afford to buy a costly computing system. A costly system is expected to be a reliable one because of its built-in redundancy in various components. Many commodity systems use off-the-shelf microprocessors or micro-controllers that may lack Error Correcting Codes (ECC). Electrical surges, transients, alpha particles or cosmic rays etc., often cause multiple bit errors in a memory or in a processor register. As a result, an application fails often. The vast majority of hardware failures in modern microprocessors, especially for memory faults, are because of limited hardware detection in them.
Transient faults (whose presence is bounded in time) are random events. Transient bit errors can be tolerated by re-computing an application afresh. Designing low-cost fault tolerant technique is necessary for future microprocessor systems. The simplest way to detect an error in computation is to execute a code twice with similar input and then to compare both the outputs. An inequality in outputs indicates errors in a microprocessor system. Here, time redundancy is of the order of two. There are other approaches that intend to complement the intrinsic Error Detection Mechanisms (EDM) with a set of carefully chosen software error-detection techniques. Such techniques include Control Flow Checking (CFC), and Assertions. In CFC, the application program is partitioned in basic blocks and a deterministic signature is computed for each block and faults can be detected by comparing the run-time (or observed) signature with a pre-computed and stored (or expected) signature for the corresponding block. The main problem with CFC is to tune the test granularity that needs to be used. The basic idea of Assertions is to insert logic statements at different points in the program to reflect invariant relationships between the variables of the program. It may lead to different problems, since assertions are not transparent to the programmer and their effectiveness depends on the nature of the application and on the ability of the programmer.
Both the CFC and Assertions rely on application semantics. Readers may refer to the work that makes available different fault-tolerant configurations and maintains run-time adaptation to changes in the availability requirements of an application. Other works describe various fault tolerance approaches (Recovery Block and N-version Programming Schemes, Triple Modular Redundancy etc.,) that use design diversity to tolerate software design bugs. Most of the conventional methods to detect faults in microprocessors are based on Parity bits, Cyclic Redundancy Checks and Check-sum or such Error Correcting Codes (ECC) in the memory or in a processor. ECC is useful for detecting and correcting a few bit errors only in memory. Current fault tolerant techniques utilized in commercial systems such as IBM S/390 G5 rely on redundancies. For example, duplicating chips and comparing results implement error checking. This work is intended to tolerate electrical short-duration noises by verifying the intermediate state of the program during its run time. Interested readers may refer to related works as further readings. Triplicate modules, as mentioned in Saha 2003, are used to detect faults as well as recovery thereof. The work described here is primarily based on the scheme as described in the works of M. Rebaudengo et al, 1999 and of Saha 2006.
The basic steps involved in Duplicated Instruction approach are as follows:
As a benchmark, a simple C-language program (as shown in Table-1) meant for finding an average of a set of real numbers is used in this paper. We used duplicated input, output and temporary variables along with duplicated program instructions and the code for comparison of intermediate and final results. The work here does not rely on multiple (design) version of an application. It relies on a single version of program for an application. This is a low-cost solution for fault tolerance and can easily be implemented in our daily life application on affording an extra time and memory redundancy. The overhead on memory space is of the order of 2 here, whereas the execution time overhead is of the order of 1.6 only. Such overhead is an affordable only for designing a reliable computing application. This low-cost simpler approach is suitable for designing reliable commodity systems using ordinary available components only.
The interested reader may consult the following references:
1. D. Pham, "Software Fault Tolerance," IEEE Computer Society Press, 1996.
2. H. Hecht and M. Hecht, "Fault- Tolerance in Software, in Fault - Tolerant Computer System Design," D.K. Pradhan, Prentice hall, 1996.
3. M. Rebaudengo, M. Sonza Reorda, M. Torchiano, M. Violante, "Soft-error Detection through Software Fault-Tolerance Techniques," Proceedings of the International Symposium on Defect and Fault Tolerance in VLSI Systems, pp.210-218, 1999.
4. Goutam Kumar Saha, "A Low-Cost Testing for Transient Faults," ACM Ubiquity, Vol. 7(2), January, 2006, ACM Press, USA.
5. Goutam Kumar Saha, "Software Fault Tolerance Through Run - Time Fault Detection," ACM Ubiquity, Vol. 6(46), December, 2005, ACM Press, USA.
6. Goutam Kumar Saha, "Instruction Replicated Scheme of Fault Tolerant Computing," Proceedings of the International Conference on Modelling and Simulation, MS 2006, Argentina.
7. Goutam Kumar Saha, "Transient Fault Tolerance Through Recovery," ACM Ubiquity, Vol. 4(29), September, 2003, ACM Press, USA.
8. Goutam Kumar Saha, "Application Semantic Driven Assertions toward Fault Tolerant Computing," ACM Ubiquity, Vol. 7(22), June 2006, ACM Press, USA.
9. Goutam Kumar Saha, "Transient Fault -Tolerance Through Algorithms," IEEE Potentials, Vol. 25(5), 2006, IEEE Press, USA.
GUTAM KUMAR SAHA
In his last nineteen years' research & development and teaching experience, Goutam Kumar Saha e has worked as a scientist in LRDE, Defence Research & Development Organisation, Bangalore and at the Electronics Research & Development Centre of India, Calcutta. At present, he is with the Centre for Development of Advanced Computing, Kolkata, India, as a Scientist-F. He is a fellow in IETE and senior member in IEEE, Computer Society of India, and ACM etc. He has received various awards, scholarships and grants from national and international organizations. He is a referee of CSI Journal, AMSE Journal (France), IJCPOL (USA), IJCIS (Canada) and of an IEEE Journal / Magazine (USA). He is an associate editor of the ACM Ubiquity (USA), International Journal of the Latin American Center for Informatics Studies (CLEIJ) and of the International Journal of Computing and Information Sciences (Canada). His fields of interest include software based fault tolerance, web technology, EIS, Ontology Engineering and Natural Language Processing. He can be reached via [email protected], [email protected].
Ubiquity welcomes the submissions of articles from everyone interested in the future of information technology. Everything published in Ubiquity is copyrighted �2007 by the ACM and the individual authors.
A Ubiquity symposium is an organized debate around a proposition or point of view. It is a means to explore a complex issue from multiple perspectives. An early example of a symposium on teaching computer science appeared in Communications of the ACM (December 1989).
To organize a symposium, please read our guidelines.
Ubiquity Symposium: Big Data
- Big Data, Digitization, and Social Change (Opening Statement) by Jeffrey Johnson, Peter Denning, David Sousa-Rodrigues, Kemal A. Delic
- Big Data and the Attention Economy by Bernardo A. Huberman
- Big Data for Social Science Research by Mark Birkin
- Technology and Business Challenges of Big Data in the Digital Economy by Dave Penkler
- High Performance Synthetic Information Environments: An integrating architecture in the age of pervasive data and computing By Christopher L. Barrett, Jeffery Johnson, and Madhav Marathe
- Developing an Open Source "Big Data" Cognitive Computing Platform by Michael Kowolenko and Mladen Vouk
- When Good Machine Learning Leads to Bad Cyber Security by Tegjyot Singh Sethi and Mehmed Kantardzic
- Corporate Security is a Big Data Problem by Louisa Saunier and Kemal Delic
- Big Data: Business, technology, education, and science by Jeffrey Johnson, Luca Tesei, Marco Piangerelli, Emanuela Merelli, Riccardo Paci, Nenad Stojanovic, Paulo Leitão, José Barbosa, and Marco Amador
- Big Data or Big Brother? That is the question now (Closing Statement) by Jeffrey Johnson, Peter Denning, David Sousa-Rodrigues, Kemal A. Delic