acm - an acm publication

Articles

Software implemented fault tolerance through data error recovery

Ubiquity, Volume 2005 Issue September | BY Goutam Kumar Saha 

|

Full citation in the ACM Digital Library

This paper examines how a new software-implemented data error recovery scheme can be so effective in comparison to conventional Error Correction Codes (ECC) during the execution time of an application. The proposed algorithm is three times faster than the conventional software-implemented ECC and application program designers can easily implement the proposed scheme because of its simplicity while designing their fault tolerant applications at no extra hardware cost. The proposed software-implemented scheme for execution-time data-error detection and correction relies on threefold replication of application data set as a basis for fault tolerance.


1. Introduction

In this proposed scheme, we have used an error-masking scheme as well as a data recovery scheme that corrects the corrupted data. We have used triplicate data model. The proposed approach is suitable for any application that refers to a set of data during its computation time and that has no memory space problem for accommodating triplicate data, whose correctness is very important for producing correct output. Three images of a lookup data of an application under execution are kept in the system memory. Upon execution of the application, three bytes are patched from the three different images stored in different memory segments and compared to check the validity of the fetched data byte. If more than two fetched bytes have the same contents, the byte is considered as fault-free, otherwise, faulty. The proposed scheme is compared to some other available error checking and correcting codes. ECC codes are commonly used for solid-state memory systems for online error detection and correction. CRC codes are for checking the validity of the data over unreliable communication links. BCH codes are another cyclic coding scheme for error detection and correction. RS codes are block-based error correction codes commonly used for massive storage media such as CD, DVD, etc. The proposed scheme aims to supplement the conventional ECC codes. Design of online low-cost software implemented scheme for data error or byte error detection and recovery thereof is always useful for any low-cost reliable application systems. Simplicity of a scheme is equally important for system engineers while employing such scheme in their application systems. Here, we focus on software implementation for an application's data-byte-errors-detection and correction only through an extended triple modular redundancy (TMR) -type scheme and error masking. This technique is applicable to any table look-up scheme. Short duration noises often cause random data corruption in the primary memory of a computing machine. Electrical Fast Transients (EFT), Electrostatic Discharge (ESD), Electromagnetic Pulses (EMP) are the example of short duration noises.  A scientific application often computes erroneous results if it reads bad data from a table.  Often, we take it granted that our program code and data banks are absolutely correct while designing software for an application.  But, it is always not correct because the high- speed processing units are often victimized by short duration noises [1-4].

 

2. Conventional ECC

We all know about the various conventional error detection and correction schemes like Checksums, Hamming Codes, Cyclic Redundancy Checks (CRC), Weighted sum code (WSC) [14], Bose-Chaudhuri-Hocquengham (BCH) codes and Reed-Solomon (RS) codes etc. They are not free from limitations [5,6,7]. The single parity checks can detect only odd number of single bit errors. Any even number of single bit-errors remains undetected. So it is inadequate for detecting any number of errors. CRC is useful for multiple bit-errors detection. Again in CRC, shift register circuits are normally used. The Shift register circuit is for dividing polynomials and finding the remainder. Modulo 2 adders, multiplier circuits are also used. In CRC, the receiver fails to detect the errors when aliasing occurs; the remainder is zero, erroneously. Again, CRC is having high time redundancy and that is why they are normally hardware based. Software based CRC implementation is impractical in a real time application. An h-bit CRC, generated by G(X)=(X+1)P(X), where P(X) is a primitive polynomial of degree h-1. The CRC has maximum code-word length of 2h-1-1 bits and minimum distance d=4, i.e., all double and odd errors are detected. This code can also detect any error burst of length h bits or less. In other words, its burst-error-detecting capability is b = h.  Although the CRC, WSC and Fletcher Checksum can be extended to have infinite length, but their minimum distances all reduce to 2.  Again, a Hamming code is to provide only the single bit-error correction and double bit-errors detection. In a typical Checksum where n bytes are XORed and the result is stored in (n+1)th byte. Now if this byte itself is corrupted due to transients or in the case of even changes, the errors remain undetected by this typical Checksum.  The Bose, Chaudhuri and Hocquenghem (BCH) codes form a large class of error correcting codes. The Hamming and Reed-Solomon (RS) code are two special cases of this powerful error correcting technique. Hamming, BCH and RS codes have nice mathematical structures. However, there is a limitation when it comes to code lengths. These conventional error correcting codes namely, BCH, RS have limitations and there exists very high time redundancy when they are implemented by software. When the check-bits become erroneous, the stored check bits do not match the computation and as a result, a block code fails. In general, checksum schemes fail when they are corrupted enough to transform to another valid code work (the distance of the code). However, the proposed technique is capable enough to detect byte-errors and corrections thereof with an affordable redundancy in both time and memory space. The proposed scheme is three times faster than CRC and BCH. However its space redundancy is more than these popular codes. Interested reader may refer to [7-12] for related works on fail-safe kind of fault tolerance through error detection.

 

3. The Software-Implemented Scheme

The steps or algorithm for data byte error detection and correction thereof is stated in algorithm “DCDE”.

 

Algorithm- DCDE  Procedure

 

The following basic steps show how the DCDE scheme works.

It verifies the corresponding three data bytes at an offset say, f, of the three images of the application data and if any byte error has occurred, then it repairs the corrupted byte. Starting addresses of the three images are known. */

 

 

 

                       

4. Discussion

It is assumed that the starting addresses of the three images of the set of application data are say I1, I2 and I3 respectively.  When the offset f is say, 0 (initial value), then the address I10 denotes the starting address of first image I1 only, because I10 has the value of  (I1  + 0) i.e., starting address plus offset.  In general, if Im be the starting address of the mth image then, the address of the f th byte (or at offset say, f) is shown by equation (1).

 

Imf = Im + f ... ... (1)

 

Again, if any one byte out of the three corresponding bytes of three images at an offset say, f is corrupted, then this DCDE routine can repair the corrupted byte by XORing.  The affected byte is detected by means of comparison of three bytes at the same offset, as shown at step 3 and step 4 of the Algorithm- DCDE.  If two corresponding bytes content are same then XORing result will be zero i.e., 00000000 or 00H. Otherwise, result is a nonzero value. In general if the two byte-contents of mth and nth images at an offset say, f, are say, Imf,  In and  if these  two values are not corrupted, then the following equation  is true.

 

Imf .XOR. I nf = 0 ....... (2a)

Otherwise, if the two byte-contents are not similar, then the equation (2a) will not be satisfied, i.e.,



Imf .XOR. I nf ≠ 0 ....... (2a)

The possibility of getting inadvertently alteration (by transients) of two bytes at distant locations, into a similar corrupted value in order to dissatisfy equation (2b) is almost zero. In other words the chances of byte-errors remaining undetected is


1 / (28) * 1 / (28) = 2 -16

This method is capable of detecting even all 8- bit errors i.e., even an entirely corrupted byte can also be detected.

Again, it can repair all the 8-bit errors. If say, Inf byte is corrupted to In*f, and say, Imf   and   Iof  byte contents are same at an offset f of   two images Im and   Io, and then by comparing three corresponding bytes of the three images, we can detect that Inf byte is corrupted (as shown at step-3 and step-4 of the Algorithm-DCDE).  The corrupted byte is repaired in the following way. This is applicable even for all 8 bit-errors in a byte.


Now, the bit pattern of  (In*.XOR.    Smn )  will  be  1001 1101    i.e.,  In*f   is  repaired.

If there is no error in program and data code then the following equation will be satisfied.

 

Imf=Inf=Iof ........(4a)

The chance of satisfying the equation (4) by the corrupted three bytes of the three images at the same offset is negligibly small, because the transients’ effects on memory, registers are very random and independent in nature.


1/(28)*1/(28)*1/(28)=2-24 .......(4b)

 

In other words, the chances of three bytes at different locations corresponding to a particular value with similar bit pattern, getting altered (due to random effects of transients) simultaneously to a similar value in order to satisfy equation (4c) is negligibly small.


Im*f=In*f=Io*f .......(4c)

Again, the chances of a particular value (of one byte size) stored at same offset in the three images, getting altered to three different values (of different bit pattern), is


1/(224) or, 2-24

 

The above disastrous effect indicates a possibility of memory hardware or permanent errors and the ERROR routine is invoked for necessary recovery thereof. In other words, the possibility of invoking the error routine namely, ERROR  (as shown at step-4) will be negligibly small. 

This is very effective for on-computing application-data error detection and error recovery of the application-data during the life cycle of the application.  After detecting and repairing the entire application data, program control goes back to the main application. Even a totally corrupted image can be repaired by this proposed technique by repairing one byte after another byte.

Space redundancy of this proposed technique is about three. However, because of the lower economic trend on the hardware prices, this much space redundancy can be easily affordable in many applications where space and time constraints are not so stringent.  Little higher time redundancy can also be afforded because of easily affordable high-speed machine. This proposed technique is capable of detecting and repairing any number of soft errors (not reproducible) as well as permanent errors during the run time of an application. Fault detection is inversely proportional to the upper limit of the variable
No_of_Calls_to_DCDE, i.e.,
FD α 1 / No_of_Calls_to_DCDE ........ (5)

Thus, depending on the threat of transients potential, the upper limit value of No_of_Calls_to_DCDE  or the frequency of calls to the DCDE routine from the various points of an application program can be changed in order to meet the application designers' requirement.

 

5. Effectiveness of the proposed Scheme Compared to ECC

The various block codes like BCH, Hamming, RS codes have nice mathematical structures. However, there is a limitation when it comes to code lengths. A bounded distance algorithm is usually employed in decoding block codes and, it is generally hard to use soft decision decoding for block codes. Again such codes are not always adequate for on-chip DRAM applications. The encoding and decoding circuits of these codes employ a linear feedback shift register (LFSR), which, if used in a DRAM chip, will introduce very high access delay. Again such codes are not suitable for detection and correction of a byte that gets entirely corrupted. Software implementation of these codes suffers from very high time redundancy. The code rate (RC ) of BCH (n=63, k= 45) is (45/63) or 0.7, where k = information bits and, n= code length = (k + r), r is the number of parity check bit, whereas in this proposed approach's RC  is  (8/24) or 0.33.   Code efficiency of BCH (63,45) or E = (Number of correctable bits (t) / Code length (n)) = 3/63= 5%, whereas, the proposed scheme's code efficiency is (8/ 24) or 33.33 %.  In a BCH code, the number of bits which can be corrected in an n-bit code word is t = r/m, where m=log2(n+1).    In other words, number of parity bits (r) required to correct t bits = (t * m) = t * log2(n+1). Thus, in order to correct all 8 bits the BCH code length (n) becomes 56 bits where k is 8. So for the worst case when all bits of a byte get corrupted then the space redundancy of BCH (56,8) becomes completely unaffordable. In such case of entire byte corruption, BCH code has the least code efficiency of  (8/56) or 14%, whereas the proposed scheme has the code efficiency of 33.33 %.  The minimum distance between BCH codes is related to t by the inequality

2t +1 £ d min (odd); 2t + 2 £  d min (even). 

In other words, the Hamming code, where n  = 2r - 1, is seen to be a BCH code where m = r, so that t=1. The RS code can also correct burst errors. With 6 check symbols, up to 3 symbol errors and therefore 3 bit errors can be corrected. Again implementation complexity of the coding and decoding scheme (in block codes) increases with increase in code length and its capability to detect and correct errors. The required computational operations increase as the number of parity symbols increases. It is seen that software operation count per byte of the CRC-16 (that are generated by binary polynomials such as X 16 + X 15 + X 2 + 1) is 46.2 and the operation count per byte of the proposed scheme is only 15. It is seen [13] that software operation count per byte tCRC (s,n) = 5.5ns + 3n - g(s), where a code-word consists of n tuples or polynomials and each tuple has s bits. When s=8 and n=64 then ns = 512 bits and g(8)= 52. Then the ratio of the byte operation counts of the CRC and the proposed scheme (including detection and correction of all eight bits of an information byte) is 46.2 / 15 = 3.08 (for only error detection, this ratio is 46.2 / 5 = 9.2, where 5 is the operation count per byte of the proposed scheme for error detection only through comparing bytes only). In other words, the proposed software scheme is 3 times faster (for error detection and recovery of an entire information byte) than the software implemented CRC code with 16 check bits. Error detection is nine times faster here. Code efficiency is also higher than BCH code.

 

5.1 The Overhead Comparison

We all know that the major drawback of error detection and fault tolerance by software- means comes from the increase in execution time and the memory area overhead. On studying over a simple program of Bubble sort of 60 integer values, the overhead factors are listed below in Table 1. It is observed that this software scheme leads to a better performance.

 

Program Approach Time Overhead Factor Memory Overhead Factor
CRC-Non-Distribute >10 <2
Hammimg >10 <3
Proposed Software Scheme (SF) <2.5 <4

Table 1.  The Overhead factors of various software approaches

 

6. Conclusion

The proposed software-implemented scheme is much faster in comparison to the conventional software-implemented ECC and is also easier for implementation for the application designers. This unconventional technique is a cost-effective and an economical one in comparison to the popular ECC in order to detect and repair transient caused byte errors. During the life cycle of an application where an ultra reliability in the computed result is essential, this proposed technique will be very effective at the cost of affordable redundancy in both time and space, without any additional cost on hardware and software development. However, for mobile or embedded devices those operate under extreme space and energy management constraints, such 3-fold space replication may not be suitable right now. It is also a very useful and a low cost tool for gaining dependable computation. It provides high fault detection and reliability in the application data on bearing an affordable extra execution time and memory space.

 

References:

[1] T.Anderson and P.A. Lee, (1981), “Fault Tolerance Principles and Practice”, PHI.

[2] Dimitri Bertsekas, (1989), “Data Networks,” PHI, pp.50-58

[3] Stephen B. Wicker, (1995), “Error Control Systems for Digital Communication and Storage”, PH, NJ, USA, pp. 72-127.

[4]. G.K. Saha, (1994), “Transient Control by Software for Identification of Friend and Foe (IFF) Application, Int’l Symposium Proc. EMC’94, Sendai, Japan, IEEE Press, pp. 509-512.

[5]ManYoung Rhee, “Error–Correcting Coding Theory,” McGraw Hill Publishing Company, USA.

[6] G.D. Nguyen, (1998), “A General Class of Error-Detection Codes,” Proc. 32nd Conf. Information Sciences and Systems, pp. 451-453.

[7] Charles Wang, Dean Sklar, and Diana Johnson, (2001/2002), “Forward Error-Correction Coding,” Cross Link, The Aerospace Corporation Magazine, vol.3(1).

[8]  Goutam Kumar Saha,  (2005)   “A Software Fix Towards Fault Tolerant Computing,”  ACM Ubiquity, Vol 6(16), May, ACM Press, USA.

[9] A.L. Liestman et.al (1986) “A Fault Tolerant Scheduling Problem,” IEEE Transactions on Software Engineering, Vol. SE-12, pp. 1089-1095.

[10] G.K. Saha, (1995). “Designing an EMI Immune Software for a Microprocessor Based Traffic Control System,” presented paper, Proc. 11th IEEE sponsored Int’l Symposium EMC Zurich'95, Switzerland, pp.401-404.

[11] Goutam Kumar Saha, (2005), "Software Based Fault Tolerant Computing,” in press, ACM Ubiquity, ACM Press, USA.

[12] Goutam Kumar Saha, (2005). “Transient Software Fault Tolerance Using Single-Version Algorithm,” ACM Ubiquity, Vol 6(28), August, ACM Press, USA.

[13] Gam D. Nguyen, (2005), "Error - Detection Codes: Algorithms and Fast Implementation," IEEE Trans on Computers, vol.54(1).

[14]  D.C. Feldmeier, "Fast Software Implementation of Error Detection Codes," IEEE / ACM Trans. Networking, vol.3(6), pp. 640-651, Dec. 1995.

 

About the Author

In his last seventeen years’ research & development and teaching experience, he has worked as a scientist in LRDE, Defense Research & Development Organization, Bangalore, and at the Electronics Research & Development Centre of India, Calcutta.  At present, he is with the Centre for Development of Advanced Computing, Kolkata, India, as a Scientist-F.  He has authored about one hundred research papers in various International and national Journals and Conference proceedings etc.  He is a senior member in IEEE, Computer Society of India, a member in ACM, W3C ITS Working Group, and a Fellow member in IETE. He has received various awards, scholarships and grants from national and international organizations. He is a referee for CSI Journal, AMSE Journal (France), IEEE Magazine and IJCPOL etc. His field of interest is on software based fault tolerance, dependable computing and NLP. He can be reached via sahagk@gmail.com or via gksaha@rediffmail.com.

COMMENTS

POST A COMMENT
Leave this field empty