acm - an acm publication


Transient software fault tolerance through recovery

Ubiquity, Volume 2003 Issue September, September 1- September 30, 2003 | BY Goutam Kumar Saha 


Full citation in the ACM Digital Library

A low-cost and effective solution for developing transient fault tolerant application software based on a single-version program


Fault tolerance means that a system can provide its services even in the presence of faults. A system fails when it cannot meet its promises. An error is a part of a system's state leading to a failure. The cause of an error is called a fault. Electrical short-duration noises or transients often cause multiple, random and independent bit errors in a processor's registers and memory.

There are many techniques available for checking validity of data or program code. Some of these use a parity bit, cyclic redundancy checks, checksums or error-correcting codes. Such techniques can detect multiple errors but cannot correct all of them. These hardware based error-correcting codes are capable of both detecting and correcting certain types of limited number of errors. Traditional hardware fixes cannot detect and correct all such random and multiple bit errors caused by potential transients. The conventional N-Version Programming (NVP) approach to fault-tolerant software systems involves the generation of functionally equivalent, yet independently developed and maintained software components, called NVersion Software (NVS) [1].

All these independent components are executed concurrently under a system, called N-version Executive (NVX). It uses a decision algorithm based on consensus to determine final output values. However, the NVS approach fails to provide service against such random bit errors because of the lack of consensus voting output. The objective of this article is to describe an unconventional software technique to detect and recover all such random bit errors caused by transients. It does not need N-Versions of the applications based on design diversity; rather it uses an enhanced version of the application.

Description of Works

The proposed technique needs three images of the application. Say, A 1 , A 2, A 3 are the three images of the application program. The data codes reside on at three different locations beginning from, say, L 1 , L 2 , L 3 locations onwards. Each image is of, say, size of total B bytes only. Initially three pointers say, P 1 , P 2 , P 3 point to the starting addresses of the images i.e., L 1 , L 2 , L 3 respectively. The symbol _ denotes logical ANDing. The steps involved in designing this software fix namely, Fault_Detection_Recovery for generating a robust and an enhanced version of the application, are stated below.

Step 1. N = 0 /* Initialize the variable N to count the byte number */
Step 2. P 1 = L 1
P 2 = L 2
P 3 = L 3 /* Initialize the pointer variables to the starting addresses of three images of the application */
Step 3. B = Size_of_Application_in_bytes /* Total number of bytes in each image is known */
Step 4. If ( ( [P 1] == [P 2] ) _ ( [P 1] == [P 3] ) , Then:
Goto Step 5 /* Initially [P 1] denotes the content of byte at location L 1 */
/* If corresponding three bytes are not corrupted, then verify for the next byte and so on. */
Else If ( ( [P 1] == [P 2] ) _ ( [P 1] �� [P 3] ) _ ( [P 2] �� [P 3] ), Then:
[P 3] == [P 1] /* Corresponding corrupted byte of A 3 is corrected
by P 1 's content */
Else If ( ( [P 1] == [P 3] ) _ ( [P 1] �� [P 2] ) _ ( [P 2] �� [P 3] ), Then:
[P 2] == [P 1] /* Corresponding corrupted byte of A 2 is corrected */
Else If ( ( [P 2] == [P 3] ) _ ( [P 1] �� [P 2] ) _ ( [P 1] �� [P 3] ), Then:
[P 1] == [P 2] /* Corresponding corrupted byte of A 1 is corrected */
Else If ( ( [P 1] �� [P 2] ) _ ( [P 2] �� [P 3] ) _ ( [P 1] �� [P 3] ), Then:
Goto Error_Recovery /* Corresponding bytes of all the three images
are corrupted to different values and then
recover them by copying from the master file
or backup, due to such total damage */
End If

Step 5. P 1 ++ /* Pointers and byte counter N are incremented by one to point the next byte */ P 2 ++ P 3 ++ N ++ Step 6. If N <= B, Then: /* Verify for the end of application code*/ Goto Step 4. /* Verify for sanity of next byte code */ Else: Return /* Return to main application A 1 for continuing */ End If {End of the Algorithm of Fault_Detection_Recovery} Discussion The proposed technique can detect and recover errors in one byte or eight bit random errors. In other words, if any one byte out of the three corresponding bytes, say, at locations L 1 , L 2 , L 3 at the three images of the application A 1 , A 2 , A 3 is corrupted, then the corrupted byte is recovered (as shown at Step 4). However, if all three corresponding bytes are corrupted to a similar bit-pattern to represent another value, then only this technique's effectiveness is reduced. The probability of one particular value stored at three different locations, getting inadvertently corrupted to the same bitpattern representing an another value is 2 -24 only. The probability of such ambiguity causing failure of the proposed technique due to such independent faults at three images is 1 / 2 8 * 1 / 2 8 * 1 / 2 8 = 1 / 2 24 Interested readers may refer to [2,3,4,5,6] for related works on transient software fault tolerance. The procedure discussed above, namely, Fault_Detection_Recovery, can be easily invoked periodically from an application system. However, the frequency of invoking this sanity procedure depends on how noisy (electrically) the industrial computing environment is. System engineers can easily modify this frequency in order to meet the transients' threats. Conclusion The proposed technique is a low cost and effective solution towards development of a transient fault tolerant application software based on a single-version program, with few additional codes for fault detection and recovery. However, system designers can easily afford such redundancy in both space and time, in this modern computing age of affordable, high speed computing machines, in order to gain higher reliability and high transient software fault tolerance. This technique is easier to implement and more economical in comparison to conventional NVP. References [1] A. Avizienis and L. Chen, "On the Implementation of N-Version Programming for Software Fault Tolerance during Program Execution", in Proceedings COMPSAC 77, pp. 149-155. [2] Goutam Kumar Saha, "A Software Fix towards Fault Tolerant Computing", to be published in ACM Ubiquity, vol. 4 , 2003. [3] Goutam Kumar Saha, "Mobile Computing & Fault Tolerance Issues", in Computer Society of India - CSI Hard Copy, Vol. 37, August, 2003. [4] Goutam Kumar Saha, "Transient Software Fault Tolerance in Mobile Computing", in CSI 2003 - 38th National Con., New Delhi, Computer Society of India, 2003. [5] Goutam Kumar Saha, "Transient Fault Tolerant Processing in a RF Application", International Journal - SAMS, vol. 38 , pp. 81-93, Gordon and Breach, 2000. [6] Goutam Kumar Saha, "Transient Fault Tolerance in Scientific Computing", Paper No. 330, ACM Transactions on Computer Systems, 2003.

About the Author Goutam Kumar Saha is with Centre for Development of Advanced Computing, Kolkata, India. Previously, he worked in LRDE, Defence R&D Organisation, Bangalore, India, as a scientist. His field of interest is transient software fault tolerance.


Leave this field empty

2018 Symposia

Ubiquity symposium is an organized debate around a proposition or point of view. It is a means to explore a complex issue from multiple perspectives. An early example of a symposium on teaching computer science appeared in Communications of the ACM (December 1989).

To organize a symposium, please read our guidelines.


Ubiquity Symposium: Big Data

Table of Contents

  1. Big Data, Digitization, and Social Change (Opening Statement) by Jeffrey Johnson, Peter Denning, David Sousa-Rodrigues, Kemal A. Delic
  2. Big Data and the Attention Economy by Bernardo A. Huberman
  3. Big Data for Social Science Research by Mark Birkin
  4. Technology and Business Challenges of Big Data in the Digital Economy by Dave Penkler
  5. High Performance Synthetic Information Environments: An integrating architecture in the age of pervasive data and computing By Christopher L. Barrett, Jeffery Johnson, and Madhav Marathe
  6. Developing an Open Source "Big Data" Cognitive Computing Platform by Michael Kowolenko and Mladen Vouk
  7. When Good Machine Learning Leads to Bad Cyber Security by Tegjyot Singh Sethi and Mehmed Kantardzic
  8. Corporate Security is a Big Data Problem by Louisa Saunier and Kemal Delic
  9. Big Data: Business, technology, education, and science by Jeffrey Johnson, Luca Tesei, Marco Piangerelli, Emanuela Merelli, Riccardo Paci, Nenad Stojanovic, Paulo Leitão, José Barbosa, and Marco Amador
  10. Big Data or Big Brother? That is the question now (Closing Statement) by Jeffrey Johnson, Peter Denning, David Sousa-Rodrigues, Kemal A. Delic