acm - an acm publication
Articles

Software-Based Fault Tolerant Computing

Ubiquity, Volume 2005 Issue November | BY Goutam Kumar Saha 

|

Full citation in the ACM Digital Library

This paper describes how to design a software-based fault tolerant application using microprocessor (MP), in order to tolerate the burst errors in memory. This approach may be called a single -- version scheme (SVS). The SVS relies on a single version application program which is enhanced with self-checking code redundancy to tolerate memory burst errors that are difficult to correct during the run-time of an application. Conventionally, the other software based approaches can detect a few bit errors (in memory) only towards fail-stop kind of fault tolerance against transient bit errors. Reed Solomon codes are mainly effective for burst errors in coding of audio Compact Disks at offline only. The proposed online technique does not need multiple versions of software and multiple machines. This approach employs only two copies of the application software running on one machine only. Two copies of the enhanced version version of an application are used here for online error detection and tolerance thereof as well. This is an effective low-cost online tool for hardening a microprocessor-based industrial computing system or for on-chip DRAM applications using an affordable code and time redundancy against the burst errors in processor memory. The SVS aims to provide a non-fail-stop kind of fault tolerance against burst errors. This approach supplements the Error Correcting Codes (ECC) in memory system also, against both the transient and permanent bit errors in memory.


This paper describes how to design a software based fault tolerant application using microprocessor (MP), in order to tolerate the burst errors in memory. This approach may be called a single – version scheme (SVS).  The SVS relies on a single version application program which is enhanced with self-checking code redundancy to tolerate memory burst errors that are difficult to correct during the run-time of an application. Conventionally, the other software based approaches can detect a few bit errors (in memory) only towards fail-stop kind of fault tolerance against transient bit errors. Reed Solomon codes are mainly effective for burst errors in coding of audio Compact Disks at offline only. The proposed online technique does not need multiple versions of software and multiple machines. This approach employs only two copies of the application software running on one machine only. Two copies of the enhanced version of an application are used here for online error detection and tolerance thereof as well. This is an effective low-cost online tool for hardening a microprocessor-based industrial computing system or for on-chip DRAM applications using an affordable code and time redundancy against the burst errors in processor memory. The SVS aims to provide a non-fail-stop kind of fault tolerance against burst errors. This approach supplements the Error Correcting Codes (ECC) in memory system also, against both the transient and permanent bit errors in memory.

Keywords: Burst errors, multiple byte-errors, fault tolerance, single-version software.

 

1. Introduction:

 

The objective of this proposed approach is to design a low-cost robust industrial computing system that can detect  and  tolerate burst errors ( 7 bytes onwards where on an average, a MP instruction is of about six bytes long). At the same time, it is also useful against transient bit errors as well as permanent bit faults (whose presence is assumed to be continuous in time) in a memory during the run-time of an application. The vast majority of  hardware  failures in modern microprocessors (MP), especially for memory faults ( for example, bursts of errors or multiple byte errors or random bit-errors), are because of the limited hardware detection in them [1]. Transient faults (whose presence is bounded in time) are random events. Transient bit errors can be tolerated by re-computing an application afresh. Conventionally, with the use of two copies, we can  detect bit errors only. But, the SVS approach uses two copies of an application for detecting and tolerating burst errors in memory. Though, memory has Forward Error Correction (FEC) or Error Correcting Codes (ECCs) (e.g. Parity bits, Hamming Code, BCH, and Cyclic redundancy codes in which bits are interpreted as coefficients in a polynomial etc.) that are capable of detecting and correcting a few bit errors on using both code and  high time redundancy. For example, BCH (63,45) can correct only 3 errors in a  45 information bits. CRC - 32 code detects any single - bit, all double - bit, any odd number of errors, and error bursts of 32 bits errors. In   general, CRC can detect burst errors up to  length < number of redundancy bits. However, CRC (polynomial codes) take high processing time to calculate  some function  y = f(m), where m is the message data, for coding and decoding. Again, in CRC, there is a chance to have false negative test for error. Though CRC is more complicated than parity or checksum (that is,  computing the sum of all words in the application memory space before the application starts and re-compute the sum to validate with the earlier sum), it can be  implemented in hardware. The Reed-Solomon Codes (based on Galois Fields on (2 8 ) with modulo-operation ) are useful for correcting limited number of burst errors, but they are used at offline only because of  its slower decoding and encoding speed. Burst errors can also be controlled through a  memory interleaving scheme. In order to make sure that a burst of errors does not affect consecutive bits, we need here to scramble the bits in interleaving. Here, bit failures are scattered, and  there exists  high probabilities that correction bits have survived. Here, a completely incorrect batch of bytes affects a large number of samples (to be played out), but only in a minor way. However, such scheme bears a lot of housekeeping code redundancy along with  high time redundancy for assembling the scrambled bytes. Again, the Cross Interleave Reed Solomon Code or CIRC (that is, interleaving of data  and a  CRC like error correcting code) can recover a burst error of greater than 4000 consecutive bits - about 2.5 mm on a CD disc. But, CIRC is usable at offline  only.  Again, such conventional error correcting codes (ECCs) such as BCH or Reed Solomon codes are known to be inadequate for on-chip DRAM applications. The encoding and decoding circuits of these codes employ a linear feedback shift register (LFSR), which, if used in a DRAM chip, will introduce very high access delay. Owing to this constraint, Hamming or extended Hamming codes are normally used. The key to the Hamming code is the use of extra parity bits to allow the identification of a few bit errors. The Hamming code with the Hamming distance of 3, can correct only single error. Hamming codes are capable of correcting one error or detecting two errors. The conventional ECCs in memory system, are effective for few bit errors in memory only. ECCs are not effective for online burst error detection in memory. Thus, a software based online approach for tolerating burst errors is much needed. The proposed  technique also aims to be useful for on-chip DRAM application .

 

In the proposed approach, we have partitioned an application code into a  number of basic blocks (that is, branch-free parts of code). We insert an extra  No-Operation (NOP) instruction after each useful instruction in an application program. Each block is preceded by an error checking code. Each block is checked by its error checking code. Before proceeding to execute a block of code, we check for the correctness of a block-code by checking the correctness of  the extra No-Operation (NOP) code as inserted after each useful instruction, and,  by comparing the computed checksum with the previously computed and saved checksum.  If a block is detected as an erroneous one (that is, one or many  NOP code or other adjacent useful words or an  error checking code in the block get corrupted),  we check the other similar block in the extra copy of an application and if it is correct, we proceed further to execute it and thus, we avoid re-computing an application afresh. It is obvious that if one or many NOP codes in a block, get corrupted, then  adjacent useful words ( instructions or data) also get corrupted (with high probability) in the burst errors affected memory.   For a microprocessor with an instruction with more or less N bytes long, this method can detect burst error of (N+1)  bytes onwards. As, we  execute a similar spare correct block of code ( when a block is detected erroneous), we tolerate transient faults also, without re-computing the same copy afresh.  At the same time, if  there is a permanent bit error at the memory space of  an affected block of code (in a copy of an application), we tolerate such errors by executing the spare similar block of code that is,  the spare block of code in the 2nd copy of an application. In other words, we skip the memory space that got affected by burst errors or transients or permanent bit errors, and then we execute the duplicate block of code after it has been detected (by the error checking code) as an error free one.  It is obvious that if a NOP code does not get corrupted but, an adjacent useful word gets affected by a few bit flip errors, then an added checksum mechanism at each block's error checking code,  detects  a few bit errors. Checksum or such Error Correcting Codes (ECC) or Error Detection Mechanism (EDM) in the memory or in a MP, are useful for detecting and correcting a few  bit errors only in memory. Conventional ECCs are not effective for online  detection and correction of memory burst errors, but they are effective for single or few bit flips in memory. That is why, we augment our approach with an additional  checksum mechanism [11] at each block- error checking (EC) code. Thus, bit errors in the useful instructions, data or in a block's error checking code itself, are also detected by this SVS approach, and we tolerate such errors by executing the spare similar block in the spare copy of an application. Thus, we  avoid reloading and re-starting an application for re-computing it afresh. It is well known that  re-loading  and  re-computing are normally useful for tolerating transient bit faults only.  But, our aim is to online detect and tolerate burst errors only.  This SVS approach does not aim to correct errors. Such error detection is also possible by comparing bytes of a copy with its duplicate [14]. But, we cannot tolerate fault here.  We can also gain error tolerance through recovery by using triplication (massive redundancy).   Thus, the proposed (SVS) approach supplements a MP's EDM or a memory-ECC mechanism for tolerating operational faults that may occur during service delivery of the use phase. Though transient errors affecting the memory are tolerated by the ECC of the modern memory, the proposed approach is useful mainly for detecting and tolerating burst errors, as well as for transient, permanent bit errors in memory. SVS is aimed to supplement ECC in tolerating few bit errors in memory. Because, the ECC of the memory can only detect and recover few bit errors in memory.

 

We should accept that, relying on software techniques for obtaining dependability means  accepting some overhead in terms  of increased size of code and  reduced performance. Fault tolerance is the ability of a system to perform its function correctly even in the presence of internal faults. Transient faults occur once  and then disappear. If the operation is repeated, the fault may go away. A permanent  fault is one that continues to exist until the faulty component is repaired. Software Fault   Tolerance  is   the reliance on “Design Redundancy” to mask residual design faults present in  software program. Current fault tolerant techniques utilized in commercial  systems such as IBM S/390 G5 [1,2,3]  rely on  redundancies. For example, error checking is implemented by duplicating chips and  comparing results. These  techniques  need  two times  or more hardware overhead.  In addition, the duplicate  and compare is adequate  for error detection only. Hence, low-cost  fault tolerant technique  is necessary for future microprocessor systems. This paper describes an economically very important  method to harden a microprocessor-based  system  against burst errors,  multiple bit-faults, permanent and transient bit errors  by acting on software only. 

 

1.1  The Proposed SVS  Mechanism

 

The proposed SVS technique adopts the strategy  of  self-checking and self-controlling programming  that  relies  on code and time redundancy. Unlike the triple modular redundancy schemes, this approach does not aim to eliminate software bugs.  We assume that the code is correct, and the  faulty behavior is only due to memory burst errors affecting the system.  In  this technique, we insert a one-byte “NO-Operation (NOP)”  instruction code after each useful instruction in the basic processing logic of an application. We consider the fault model of  at least seven or more contiguous byte-errors or multiple bit flip or  a burst of  modified bits. We know those memory locations or offsets  where the extra NOP –code have been inserted. Execution of  NOP code does not alter the status flag bits in a processor status word (PSW) and it  provides a delay only.  The correctness of these  additional  NOP codes and  checksum are  checked  before the program control goes forward to execute the adjacent  application – instructions in a basic block of code.  As soon as, either  a   NOP code-corruption  or checksum-mismatch error is detected by the proposed error checking code,  the  program control   skips (on noting down the erroneous block’s identity on a memory variable namely, PREV_EC) the  corrupted  block  of an image of  an application and, then it goes forward to verify for errors and to execute a similar block at the other spare image of  an application.  We consider  that at least two   copies or images  of  an application  reside on memory for tolerating  such potential burst or  bit-errors. However, for fault detection, the proposed SVS needs only one copy of an application.  An application image consists one or more basic blocks of code.  We keep one error checking code at each block of  code in an image of  an  application. In other words, each  block of code in an application image is  preceded  by an  error-checking or error-processing code  that detects burst errors by  checking for  the  correctness of  the extra  NOP  codes  that are inserted at  a block (for detecting burst errors),  as well as, it verifies for  the correctness of other code through checksum.    The blocks are identified in a such manner that  an inter-block  branching  does not  arise.  Each block of code can also be implemented by a procedure. For  a procedure- call, we need to keep an error-checking code that precedes the instructions of a procedure. In  procedure also, we need to insert a NOP instruction after each useful instruction in a procedure.  As a result, each block-size may not be same.  Similarly, each error-checking  code, not  necessarily, may cover  the same number  of  useful instructions, because of  the possible existence of different block -sizes that depend on the branching displacement. In other words, if  a jump instruction (say, at  location L)  branches to an instruction (say, at location M), then  we say the branch displacement as  the absolute of  (M-L) and in such case, we keep  the instructions at  locations L through M  in  the same block of code only, in order to prevent execution of  unchecked block-code.  Such SVS scheme does not allow execution of  instructions in a block unless they are checked as error-free by that block’s error checking code. The aim of this proposed scheme is not to eliminate software bugs nor to correct errors; rather, it  aims to mask or skip the erroneous block of code for designing a non-fail-stop kind of fault tolerant system against burst errors. In other words, unlike the other single-version, software implemented fault tolerance schemes that detect few bit-errors and,  that rely on restarting or resetting a system (fail-stop kind of fault tolerance) against a fault model of a few transient bit errors in memory, the SVS approach can tolerate few faults  to go ahead with the execution of  an application further (non-fail-stop kind of fault tolerance) until the application crashes when no spare correct similar block of code exists.

 

2. Previous Works:

 

In order  to achieve  ultra reliability in computing ,  it is  necessary to adopt  the strategy  of defensive  programming based on redundancy ( i.e. fault - tolerant software), e.g.,  Recovery Blocks (RB) [4,9], N Version Programming  (NVP) [5].   In RB, the acceptance test condition  is  expected to be  met by the successful execution of either    the primary  module or the alternate modules. When an acceptance test detects a primary module failure, an alternate module  executes. If all alternate modules  are exhausted, the system crashes.  In NVP,   N number    of    variants    or      alternates    run simultaneously on N different machines  and at the end of program , the results are voted upon to find  an answer in majority and it is  considered  as a correct result. If no consensus  result  is found,  then the NVP system crashes.  However, both  RB and  NVP   need   multiple versions  of  software to be developed independently using different languages, tools etc.  In reality, designing  one version of reliable software is   itself   a very costly and challenging task. Again, designing multiple versions of software  is  found to be very expensive and beyond reach for many  low cost applications.   However, in critical systems with real - time deadlines, voting at program’s end may not be acceptable. This scheme requires synchronization of  the various software   versions     at comparison points.  The RB scheme needs  f+1 number  of  alternates to tolerate  f  sequential faults. The NVP scheme needs  f+2 number of  alternates  to  tolerate  f  sequential faults.  The various single-version software implemented fault tolerance (SIFT) schemes, for example, Algorithm Based Fault Tolerance (ABFT) [6], Assertions [7, 17, 18], and Control Flow Checking [8] are  meant for  supplementing the  intrinsic error detection mechanisms of  a  microprocessor system only for designing fail-stop kind of fault tolerance against the fault model of transient bit errors in memory. ABFT is suited for applications using regular structures. Its applicability is valid for a limited set of problems. Therefore, it lacks of generality. The use of logic statements or assertions at different points in the program that reflect invariant relationships between the variables of the program can lead to different problems. Because, assertions are not transparent to the programmer and their effectiveness largely depends on the nature of an application and on the ability of a programmer.  Again, the success of  Control Flow Checking largely depends on partitioning an application program in basic blocks (branch - free parts of code). For each block, a deterministic signature is computed and errors can be detected by comparing the run-time signature with a pre-computed one. In most of the control flow checking techniques, one of the main problems is to tune the test granularity that should be used. In procedure duplication (PD) [19], a programmer decides to duplicate critical procedures and to compare the obtained results for detection of transient bit - errors. Here, a  programmer has  to define a set of procedures to be duplicated and to introduce the proper checks on the results. So, PD approach is useful to detect a few bit errors only, towards fail-stop fault tolerance through re-starting an application. Such  PD is not effective for burst errors because a procedure which is affected by bursts of errors, may not produce result because of hanging or wrong memory accesses or other exceptions etc., and because of no results, we cannot compare the results, and as a result, such PD cannot detect burst errors. These SIFT techniques that basically rely on a set of carefully chosen software detection techniques, aim towards detection of  few bit - errors in memory towards fail-stop kind of  fault tolerance through system reset and  they lack of generality and  applicability.  Row-checksum based fault detection and tolerance has been discussed in the work [11].  Interested readers should refer to other important works on hardware or software implementations of time-constrained and reliable embedded systems [10, 12, 13] also. The proposed SVS approach is able to detect (by its error checking code) the burst errors (through checking the NOP codes) as well as  bit-errors (through checksum) in a block of code that consists  N number of useful instructions and N number of  inserted NOP code, and  it  tolerates faults through execution of a  similar, spare and correct block of code to go ahead with the application (unlike a fail-stop kind of  fault tolerance in a SIFT) without resetting the system. However, the proposed SVS converges to a fail-stop approach when there is no more spare and correct block.  Conventional SIFT techniques are useful for the fault model of few bit errors only. But, the proposed approach works well against the fault model of burst errors (even all bit errors in the NOP instructions that get detected by the error processing code),   as well as, against a few bit errors in memory that affect a NOP code or useful instructions or error processing code, detected by the inclusion of checksum at an error processing code. Like any conventional SIFT ,or triple modular  redundancy (TMR) based fault tolerance schemes, this SVS  approach  also cannot claim to be free from an overhead   on  code  and  execution time redundancy.  Execution time redundancy as observed in SVS, is  about 2.7 times  the basic application code without  any  software fix for fault tolerance. If an enhanced application program is partitioned into three basic blocks, then we can tolerate three faults using more time (3.3 times the basic application code without fault tolerance) overhead.

 

3. Description of Works 

 

The  proposed SVS  technique  relies on two copies  or images of an enhanced application program (partitioned into two basic blocks), in order to  tolerate  two  faulty non-identical blocks. Here, both the blocks at  either copy of  an application can be faulty, or  one faulty block is at one copy , and another faulty block is at a non-identical block of the duplicate one. The  application is  enhanced  by  inserting  “No Operation” (NOP) instruction  code  after  every basic - application – instruction  code ( say,  Ii ). The selection of NOP-code  for  insertion, is  guided  by  the  fact  that  it  is one byte long  and  normally it incurs a delay of  one machine clock only  without  changing  the flag bits in  the  processor-status-word (PSW).  During  the run-time of the application, the self-checking error-processing-code  verifies for the possible  NOP – code – corruption (at those known  locations where extra NOP codes have been inserted)  in  its block,  and  compares the computed checksum with the pre-computed checksum.  If errors in extra NOP-codes in a block are detected, or if a checksum-mismatch  is detected, then the  program control  branches to the error-processing code of the similar block of code at the other spare image, for executing that spare block after it  is detected as free of error.  Again,   errors  at extra NOP-codes ascertain the possible errors in their adjacent application-instructions also, if we consider the fault model of  seven or  more contiguous byte-errors or burst errors. Checksum is employed in order to detect transient or permanent bit errors at the useful instructions, data and at the error-processing code itself.  The  SVS  technique, relying on two images,  is  explained  in the following  steps.  The notation,    Lij  , in general, denotes  the memory- location of the i-th  NOP – code (i.e. inserted one) in  the  j-th  image of the application.  The notation ECi j denotes the  error processing code for the  i- th block  in the  j-th image of an application. The ECi j error-processing code is  intended to check the  code at the i-th block of the j-th copy or j-th image say, Bi j. The symbol  "[  ]" denotes the byte-content.  The images execute on global input, output variables. Another global variable say,  PREV_EC  is used to note the most recently (earlier to the present block’s error code) executed error-processing code that had detected errors positive in its block, just earlier to the present block’s error-processing code which is being executed. In other words, PREV_EC  denotes the most recently skipped block because of detected errors in that block. For an example, if the execution flow skips the erroneous m-th block  and  enters the error checking code of  the n-th block  that is  being detected as erroneous also, then  PREV_EC holds  the m –value only.   We use the symbols  "/* " and  "*/ "   to enclose general remarks.   

The following  steps  describe  the proposed SVS scheme  that  relies on two similar images of an application.

 

Image1

/* For understanding the SVS  scheme, we have taken two images of an application. For simplicity, we have considered  that each image consists   two  basic blocks  only. Each block  is preceded by  an error-processing code. */

/* EC1, the error-processing code of the 1st block  of  the 1st  image (beginning at location say,  LE11 ), covers the useful application instructions  I1   through  I6   along with  six extra NOP instructions in  the  B11  block.  */

/* For simplicity, we may consider that a block of code contains  six useful application -instructions and  six  extra  NOP instructions. In other words,  each  error-processing code  detects errors  in total  twelve instructions only. */

LE1:

 

 If ( [L11] == [L21] ==[L31]==[L41] == [L51] ==[L61] ¹ NOP-code) OR Checksum ¹  Pre-computed Checksum then:       /* The EC1checks for errors in the 1st block of the 1st image or B1.  If B11 is checked error positive, then note it at PREV_EC. */

                             Set  PREV_EC = 11   /* i.e., the EC of the 1st block of the 1st image has detected errors in B11   */

                              Jump  LE 12    

                  /* Branch  to  the Error-processing code of the 1st block of the the 2nd  image (i.e., EC12 ) that

                   begins at location say, LE 12   in the next spare Image2  for  attempting  a similar block)  */

                      /* Most recently, the B 11 block in Image1  was detected faulty and  skipped, so,  check

 for errors  in the similar and spare block B 12  by executing  its  error processing code  i.e.,  EC12  that begins from location say, LE12 , in an attempt to execute - the spare and similar instructions ( i.e.,  block  B12  )  at the next spare image or, the 2nd  image. */

               End if       

 /*1st block of  instructions begins here; its errors are detected by the above Error processing code (EC11 ) at the 1st  image  */

/*  1st  block of application code  in the 1st  image ( say, B11   )  begins here */

               I1                                    /* 1st instruction-code of  the 1st block  of  the 1st  image of an application */

L11             NOP                             /*  Extra inserted  NOP code */

                    I2                                    /* 2nd instruction- code of  1st  image of an application */

L21             NOP                                /* inserted NOP- code */

               I3

L31             NOP

               I4                                    /* 1st application- instruction- code */

L41             NOP

                    I5

L51             NOP                                /* inserted NOP- code */

               I6

L61             NOP                                    /* 1st block of the 1st    image ends here */

/*   Error-processing code of  the 2nd  block  of  the 1st  image */

LE2:

If ( [L71] == [L81] ==[L91] ==[L101] == [L111] ==[L121¹ NOP-code) OR Checksum ¹  Pre-computed Checksum,  then:

    Set  PREV_EC = 21  

/* i.e.,  the   EC of  the 2nd  block of the 1st  image  or      tests  the  B 21  as         erroneous */

   Jump LE 22

/* 2nd Block of  Image1   is faulty,  so  try the 2nd block of the 2nd Image of an application */

 /*Jump to the error-processing code of  2nd  block  of  the 2nd image.  EC22 starts at location say,  LE 22 in Image2    */

End if

/* 2nd block of instructions (B21 ) of the 1st  image begins here; it is covered  by  the above Error processing code (EC21 )  in the 1st image */

               I7

L71             NOP

               I8

L81             NOP

               I9

L91             NOP

               I10

L101           NOP

               I11

L111           NOP

               I12

L121           NOP                 /* 2nd  block of instructions in the 1st  image (say,  B21   )  ends  here */

/*  An application  can have more than  two basic blocks also. */

             END                                      /* End of  the  Image1   */

       /*Extra (on an average, the number of bytes in a block in an application) NOP instructions in between two   images can be inserted to tolerate a wrong program flow out of the Image1, Image2 */

NOP       /* Extra NOP codes at the end  of  an  image in order to catch some of  the faulty program flow */

               NOP         

               NOP

               Jump LE12

/* Jump  to  the beginning of   Image2  .  Because of faulty program flow,  switch to  next image */

 /* Image2     begins  here */

                    /* EC1, the error-processing code of the 1st block of the 2nd image (beginning at location say,  LE12 ), checks the block of instructions I1   through Ialong with the extra NOP s at  the  1st block of 2nd image or B12  block */

LE1:

 If ( [L12] == [L22] ==[L32]==[L42] == [L52] ==[L62] ¹ NOP-code) OR Checksum ¹  Pre-computed Checksum, then:

               If  3rd  Image exists, then:

    Jump  to  the  Error-processing code of  the 1st  block  of  the 3rd  image.

    /* if next image exists, then branch to the corresponding Error Processing Code of the 3rd image

                          i.e., the EC13  that begins at  location say, LE 13   at the next  Image3 , in order to execute -

           the corresponding similar instructions ( in block  B13  )  at the 3rd image. */

    Else if  PREV_EC == 11,  then:  

 /* If no spare and correct image does not exist and, if the most recently executed error processing code (having tested error positive) was  say, EC11 ( i.e., PREV_EC = 11) which was executed earlier to the present  block’s error code i.e., EC12  ; then it  indicates that  1st  block of code  of  the 1st  image  was found to be erroneous and,  its spare  similar block  (i.e.,   1st block of code of the 2nd  image) has also been found to be faulty, so application crashes. */

                               Restart     /* Application crashes. Re-load and Re-compute the application. */

                         End if

               End if        

 /*1st block of  instructions begins here; it is covered by the above stated  Error processing code (EC12  )  */

/*  1st  block of application code  in the 2nd  image ( say, B12   )  begins here */

               I1                                    /* 1st instruction-code of  the 1st  block of the  2nd  image of an application */

L12             NOP                             /*  Extra inserted  NOP code */

                    I2                                    /* 2nd instruction- code of  the 1st block of  the 2nd image of an application */

L22             NOP                                /* inserted NOP- code */

               I3

L32             NOP

               I4                                   

L42             NOP

                    I5

L52             NOP                                /* inserted NOP- code */

               I6

L62             NOP                                    /* 1st block of the 2nd    image ends here */

                    /*  Error-processing code (i.e., EC22 ) of  the 2nd block of the 2nd  image begins here */

LE2:

If ( [L72] == [L82] ==[L92] ==[L102] == [L112] ==[L122] ¹ NOP-code) OR Checksum ¹ Pre-computed Checksum, then:

   if  3rd image exists, then:                   /* with three images */

                   Jump to the 2nd error-processing code of  3rd  image (i.e., EC23 starting at location say,  LE 23 ) in Image3

              Else  If    PREV_EC ¹ 21,  then:    /* with two images */

                   Jump LE2 1 

   /*Jump to execute the EC2 1  */  

/*A typical execution flow shows how this SVS approach can tolerate two faulty blocks :  Execute EC1(errors in B11 are detected, so skip B11 ) à execute  EC12 (no error is detected in B12 code, so similar block B12 is executed) à  execute EC2(errors in B22 are detected, so   skip B22 ) à execute EC2(no error is detected in similar and spare block B21 code, so B21 is executed). Thus, two faults or two erroneous blocks are detected and tolerated by using  two copies of the SVS based application, and each application copy has  two basic blocks. */

                 Else:

                   Restart   /* Application crashes.  Restart the application for re-computing */

                End if 

       End if

/* 2nd block of instructions (B22 ) of the 2nd  image begins here. */

               I7

L72             NOP

               I8

L82             NOP

               I9

L92             NOP

               I10

L102           NOP

               I11

L112           NOP

               I12

L122           NOP                 /* 2nd  block of instructions in the 2nd  spare image (say,  B22   )  ends  here */

       /*Extra  NOP codes in between two images are to tolerate a wrong program flow out of the Image2, Image3 */

NOP            /* Extra NOP codes at the end  of  an  image in order to tolerate faulty program flow */

               NOP         

               NOP

               :

               Jump LE12

             /* Jump  to  the beginning of   Image2  */

            /* Because of faulty program flow,  switch to  adjacent image */

3.1.        Errors Effected Execution Flow for Tolerating faulty Blocks:

 

 The following  figure-1  shows  the errors effected possible inter-block, intra-image and inter-image execution flow for tolerating various erroneous or faulty blocks in an application that is  based on SVS scheme.  The string “XC” denotes “execute the correct block”, whereas the string “SE” denotes “skip the erroneous block” . An error processing code (not shown in figure-1) at each basic block detects errors in that block.

 

 

 

 

 

 

 

 

 

 

 


Figure 1.  Errors Effected Execution Flow for Tolerating Faults in SVS  Scheme.

3.2. An Example of  SVS Based Application:

 

/*For understanding the approach we take the example of computing the factorial of  N  using  pseudo code of Intel 8086 MASM. */

/* We incorporate an error processing code at the beginning of a basic block of code, and for  a  large application, we incorporate an error processing code at a location prior to the branch instruction or prior to the procedure code also.  The following SVS based application uses two copies.  Each copy has two basic blocks. Each basic block is preceded by an error processing  code (EC) that checks for burst errors (through checking NOP codes’ correctness) in the block or, a few bit errors in the useful instructions or in the error processing code itself (through added checksum in an EC). The number of instructions in a block may vary from block to block. The more the  number of instructions in a block is, the more delay (because of more checking time) is for switching to a spare block (image) for tolerating erroneous blocks. */  

 

Image 1:  

          /* The Error-processing  code for the 1st block of the 1st image is denoted by EC11.  The symbol  Ú   denotes

              logical  OR ing.    The  symbol  "   denotes  logical  XOR ing.   */

      /* EC11   begins here */

            LVL11:  If ( [L11] " [L21] " [L31] " [L41] " [L51] " [L61¹ 0 ) Ú ( run-time-checksum " stored-

                                               checksum ¹ 0 ), then:                       

                              MOV PREV_EC, 0Bh

/* Set  PREV_EC = 11 */

                           /* Most recently executed Error Processing Code is EC11   */

                              JMP  LVL12   /* Jump to  the beginning of   similar (1st) block of code at the spare 2nd  image  i.e.,

                                            Image2  */      /* 1st block of  Imageis faulty,  so  attempt the next similar image */

                          End if     /* EC11   ends here */

/* B11   begins here */

               MOV  AH, 01 h  ;  Read  the number  N from  keyboard

L11             NOP

               INT 21h                ;Request INT 21h service 1

L21             NOP

               SUB  AL, "0"     ;  Input number in  AL

L31             NOP

               MOV DL, AL

L41             NOP

               DEC DL

L51             NOP                     

               MOV CX, DL      ; Set  the counter register  for iteration

L61             NOP

/* B11   ends here */

/* EC21   begins here */

LVL21:        If ([L71] " [L81] " [L91] " [L101] " [L111] " [L121] ¹ 0 ) Ú ( run-time-checksum " stored-

                                              checksum ¹ 0 ), then:

                             If  PREV_EC =  22 ,  then:          /* If 2nd block in 2nd image is erroneous */      

                                            /* Application Crashes:  no spare image of  similar (2nd) block, because  the 2nd blocks

                                               of  both the images are erroneous */

                                             Restart the application          /* Reload and restart the application */         

                              Else      /*  Try  the spare  2nd block of the 2nd image, if  it is not erroneous */      

                                     MOV PREV_EC, 15h   /* Set  PREV_EC = 21, because 2nd block in 1st image is erroneous */

                                     JMP  LVL22     /* Jump  to  the beginning of   similar (2nd) block of code at Image2    */

                           /* Most recently executed (and detected error positive) Error Processing Code is the EC21    that 

detected errors in the 2nd block of the 1st image of an application */

                               /* 2nd Block of the Imageis  faulty,  so  try the similar spare 2nd block of the 2nd  image */

                              Endif

               Endif

/* EC2ends here */

/* Block  B2  begins here */

LVL1:    MUL  DL             ; AX ß AL * DL

L71             NOP

               DEC  DL

L81             NOP

               LOOP  LVL1      ; repeat through the level LVL1, until the counter register CX  is  decreased to zero

L91             NOP

              MOV DL, AL

L101           NOP

              MOV AH, 02h

L111           NOP

              INT 21 h                 ;  Display factorial

L121           NOP

              INT 20 h                 ; Stop

              END                         

  /* Block B12   ends here */

Image2 :

          /*  The Error-processing  code for the 1st block  of the 2nd  image is denoted by  EC12.  The symbol  Ú   denotes

               logical  OR ing. The  symbol  "   denotes  logical  XOR ing.   */   

/* EC12   begins here */

       LVL12:

If ( [L12] " [L22] " [L32] "  [L42] " [L52] " [L62¹ 0 )  Ú ( run-time-checksum " stored-checksum ¹ 0 ), then:

               If  PREV_EC = 11,  then:   /* 1st block of 1st  image  was erroneous and the similar spare block in

                                                          the 2nd  image is also found to be erroneous, i.e., application crashes. */

             Restart the application  /* Reload and restart the application */

  Else

     MOV PREV_EC, 0Ch         /* Set, PREV_EC = 12 */

     JMP    LVL11           /* Try the similar 1st block in 1st image if it is OK */

  Endif

               Endif

/* EC12   ends here */

 /*  Block B12  begins here  */

               MOV  AH, 01 h  ;  Read  the number  N from  keyboard

L12             NOP

               INT 21h                ;Request INT 21h service 1

L22             NOP

               SUB  AL, "0"     ;  Input number in  AL

L32             NOP

               MOV DL, AL

L42             NOP

               DEC DL

L52             NOP                     

               MOV CX, DL      ; Set  the counter register  for iteration

L62             NOP                   

        /*  Block B12  ends here  */

   /*  Error checking code  EC22  begins here  */

LVL22:        If ([L72] " [L82] " [L92] " [L102] " [L112] " [L122] ¹ 0 ) Ú ( run-time-checksum " stored-

                                              checksum ¹ 0 ), then:      /* 2nd block of 2nd  image is found to be erroneous */

                             If  PREV_EC =  21 ,  then:          /* 2nd block of 1st  image is also erroneous */      

                                            /* Application Crashes:  no spare image of  similar (2nd) block, because  the 2nd blocks

                                               in both the images are erroneous */

                                             Restart the application          /* Reload and restart the application */         

                              Else      /*  Try  the 2nd block of the 1st  image, if  it is not erroneous */      

                                     MOV PREV_EC, 16h   /* Set PREV_EC= 22, because 2nd block in 2nd image is erroneous */

                                     JMP  LVL21     /* Jump  to  the beginning of   similar (2nd) block of code at Image1    */

                           /* Most recently executed (with error positive) Error Processing Code is the 2nd   EC in the 2nd 

                              image of a block, EC22   */        /* 2nd block of  Imageis faulty,  so  execute the spare image */

                              Endif

               Endif

    /*  Error checking code  EC22  ends here  */

/*  Block code  B22  begins here  */

LVL1:   MUL  DL              ; AX ß AL * DL

L72             NOP

               DEC  DL

L82             NOP

               LOOP  LVL1

L92             NOP

              MOV DL, AL

L102           NOP

              MOV AH, 02h

L112           NOP

              INT 21 h                 ;  Display factorial

L122           NOP

              INT 20 h                 ; Stop

              END

/*  Block code  B22  ends here  */

4.  A New Approach to Reliability Analysis & Contribution:

 

Based on the  basic  steps  as  stated in the previous section and the reliability analysis through instructions flow (as proposed by the author),  we  get  the following  possible  program flows for deriving the probability of uncertainty of this SVS scheme.  We have taken the SVS scheme with  two images of an application. Each  image has two block of codes. For each block  of code, we have used one error-processing code. In a typical SVS based application, each basic block consists of  six (N) application-instructions and six (N) extra NOP instructions. We have avoided inter- block  branching. The notation EC11 * indicates error processing code that has detected  various erroneous NOP code (burst errors) or bit errors in the block B1.

 

The all possible Execution Flows ( the SVS with two images only) :

 

i )   EC11   à B11   à  EC21   à B2  ,    ( no error or fault has occurred in the 1st image i.e., one successful execution of an application ).

ii)  EC11 *  (fault is detected in B11 , so skip the block B11  )   à EC12   à B1à EC22   à B2, (  fault in the B1block is tolerated ).  

iii)  EC11     à B11   à EC21 *  ( fault in B2)   à EC22   à B22    ,      ( fault in B2block is tolerated ).  

iv)  EC11 *   ( fault in B1)    à EC12   à B12   à EC22 *  ( fault in B2)   à EC2à B21   ,  (total two faults i.e., one fault at B1 and  another one at B22   are tolerated ).

v)   EC11 *   ( fault in B1)    à EC12 *  ( fault in B1) à Restart,  (application crashes because of no correct spare block and  in such case, restart the application for fail-stop kind of fault tolerance).

vi)  EC11   à B11   à  EC21 *  ( fault in B2)  à EC22 *  ( fault in B2) à Restart,  (application crashes because of no correct spare block and  in such case, restart the application for fail-stop kind of fault tolerance).

 


Table 1 :  Performance  Analysis  of   SVS

                                         | ß                 TIME        REDUNDANCY                         à|

    Block size

   No  Fault

Tolerating  1 Fault

Tolerating 2 Faults

Space Redundancy

(Executable Code Size)

Probability    of

Certainty

6 application-instructions + 6 NOP instructions

 

       O(1.9)

 

       O( 2.08) 

 

        O( 2.6)

 

      O(3.2)

 

        0.84

 

12 application-instructions + 12 NOP instructions

 

        O(2.0)

 

 

       O(2.3)

 

       O(2.7)          

 

      O(3.21)

 

       0.82

24 application-instructions + 24 NOP instructions

 

        O(2.0)

 

        O(2.3)

 

       O(2.71)

 

      O(3.23)

 

       0.79

 

5. Experimental Results

 

To evaluate the feasibility and effectiveness of this proposed SVS approach, we have applied this  approach on a set of  simple  C programs (e.g., multiplication of  two matrices of 8 x 8 integer values, an implementation of bubble sort algorithm on a vector of 12 integer elements, used as benchmarks)  by manually modifying their source code according  to the proposed SVS approach.  Motorola 68040 processor and  the  compiler - Single Step 7.4 by SDS, Inc., have been used with disabling all compiler optimizations when compiling the SVS based application with  two  images.  It is observed that on average, the size of the executable code  is increased by 2.6 times and  the source code grows by 2.9 times. An average  slow-down of  about  2.7  times  is observed for tolerating two sequential faults.   The fault injection environment as described in [15,16] ,  is built around an application board hosting  a 25 MHz Motorola 68040 processor, 2 Mbytes of RAM memory, and some  peripheral devices.  Fault injection is performed exploiting  an ad hoc hardware device  which allows monitoring the program execution and triggering a fault injection procedure when a  given  point  is reached. For experiments,  the adopted  fault model is the  multiple-bit flip into  memory locations.  Bit errors are randomly  generated.  On injecting  total  2000  randomly generated faults ( 500 in  the memory area containing  data, and  1500 in  the memory area containing  the code) in an  application (based on SVS with two images) of  Matrix multiplication of two  matrices composed of  8x8 integer values , out of total 2000 errors  the  number  of   hardware detected  (by  Error Detection Mechanisms e.g., microprocessor exceptions) is  382  and  the  Software detected  (by this SVS approach in case of disagreement among the NOP code or through checksum mismatch )   is   764   and the  fail silent ( not producing  any difference  in the program behavior) is   841.   The rest  13 are the   fail silent  violations (not detected  by  any  error detection  methods).

 

Again, on injecting  total 2000  randomly generated faults ( 500 in  the memory area containing  data, and  1500 in  the memory area containing  the code) in an  application (SVS based with two images) of  solving a series of twenty quadratic equations, out of total 2000 errors, the  number  of   hardware detected  (by  Error Detection Mechanisms e.g., microprocessor exceptions) is  423  and  the  Software detected  (by this SVS approach in  cases of disagreement among the extra NOP instructions or through checksum mismatch at  our error processing code EC)   is   772   and the  fail silent ( not producing  any difference  in the program behavior) is   791.   The rest  14  is for  fail silent  violations (not detected  by  any  error detection  methods).

 

On an average, out of total 2000 errors,  the average number  of   hardware detected  (by  Error Detection Mechanisms e.g., microprocessor exceptions) is  402.5  and  the average number of  Software detected  (by this SVS approach in such case of disagreement  among the extra NOP instructions or through checksum mismatch at EC)   is   768   and the average number of  fail silent ( not producing  any difference  in the program behavior) is   816.   The average number of fail silent  violations (not detected  by  any  error detection  methods)  is  13.5. This is observed that the SVS approach has worked well along with other ECC in memory.

 

This  work  aims  not  to  tolerate  software design bugs , rather  it  aims to design a robust and  a low  cost solution for  tolerating  memory burst errors or bit errors of multiple bit flips during  the execution of  an  application  on  adopting  the code redundancy along with self-checking  code.   It  can tolerate two consecutive faults on using only two copies of an application. Wrong inter-block program flow inside an application is detected when a program control enters a block without executing its error processing code, and  such events are caught by examining the PREV_EC variable.

 

Limitations :

 

The proposed SVS scheme’s effectiveness is limited over the memory burst errors or multiple bit flips in memory.  However, a wrong program flow within a block of code in an application cannot be detected in this SVS approach. Processor register bit errors also remain undetected in SVS.  Register bit errors can be detected by comparing the results on  executing both the copies of an application with similar input. This SVS does not aim to detect register bit errors. Again, the detection of a wrong program flow out of the application image but in between the two copies of an application is limited by the number (on average, it  is the number of bytes in a block of a copy of an application) of extra NOP code, inserted in between the two copies only. Again, an error that may occur in a block after the execution of its error-processing -code remains undetected until the  re-usage of a block. In other words, the fault detection capability of  this SVS scheme is limited over  the  errors in a block that might have occurred before the execution  of  its  error-code only and such errors get detected  only  at the subsequent  reference  of  the  erroneous block.  We rely that such error  which occurs just after the execution of an error processing code, will get detected by a microprocessor's exception handling mechanism. Thus, this SVS along with processor's exceptions is an effective solution to attain non-fail-stop kind of fault tolerance through  hardening a processor-based application at the cost of an affordable overhead on both the time and space.

 

6. Conclusion

 

The proposed SVS technique is a  low  cost  and  an effective  solution  towards  designing  a  fault tolerant application system (against memory burst errors), because  it  does not bear  the  cost  of  designing  multiple independent versions of an application software,  and  the cost of multiple  machines, as required in  N-modular system or NVP system. This SVS technique  has no overhead with the synchronization tasks as  demanded  in NVP system.  This  SVS technique needs  only  two images  of the   enhanced  application software running on one machine only,  in order to tolerate two sequential faults for designing  a non-fail-stop kind of  fault tolerance for the burst errors failure model. Though the targeted fault model  is assumed  to be a burst of modified bits over a few consecutive bytes (or multiple bit flip), this  approach aims to be a useful one for tolerating operational faults of other faults model also. Error correction  is not  incorporated  in  this technique  because of  very high  time redundancy with  multiple byte-error recovery codes. However,  this  SVS technique can be  tailored  to  fit  the  exact  requirement of  a  typical  application system through proper study and at the cost of an additional time for designing such single-version fault-tolerant application system. The overhead of  extra execution time  and  space  redundancy  can  easily be afforded  using an affordable and modern higher speed  processing  system,  in order to gain  such reliable  computation. This SVS technique is also useful for preventing erroneous results caused by corrupted codes. This technique  can be  a  useful  tool for  designing  various computer controlled  application systems and  for hardening a processor-based application by acting on the software  also. The SVS supplements the ECC of a memory system and detects and tolerate memory burst errors also.

 

References:

 

[1]  L. Spainhower and T.A. Gregg, "IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective," IBM  Journal  of  Research & , Vol. 43, No. 5/6, 1999.

[2]  T.  Sato,  "Analyzing Overhead of Reissued Instructions on Data Speculative Processors," Workshop on Performance  Analysis and its  Impact on Design, held in conjunction with 25th International  Symposium on Computer  Architecture, 1998.

[3] Stephen B. Wicker “ Error Control Systems  for  Digital Communication and  Storage”,  Prentice Hall, NJ, USA, pp.72- 127, 1995.

[4] B. Randell, "Design - Fault Tolerance," in The Evolution of Fault-Tolerant Computing,  A. Avizienis, H. Kopertz, and J.-C. Laprie, eds., Springer-Verlag, Vienna, pp. 251-270 1987.

[5]  A. Avizienis, “The N-Version Approach to Fault – Tolerant  Systems,”  IEEE  Transactions  on  Software Engineering , vol. SE -11, No. 12, Dec., 1985,pp.1491-1501.

[6] K.H. Huang, J.A. Abraham, "Algorithm-Based Fault Tolerance for Matrix Operations," IEEE Transactions on Computers, vol. 33, pp. 518-528, 1984.

[7] M. Zenha Rela, H. Madeira, J.G. Silva, "Experimental Evaluation of the Fail-Silent Behaviour in Programs with Consistency Checks," Proceedings of the FTCS-26, pp.394-403, 1996.

[8] S. Yau, F. Chen, "An Approach in Concurrent Control Flow Checking," IEEE Transactions on Software Engineering, vol. SE-6, no. 2, pp. 126-137, 1980.

[9] K.H. Kim and H.O. Welch, “Distributed Execution of Recovery Blocks: An Approach for uniform Treatment of Hardware and Software Faults in  Real- Time  Applications,”  IEEE Transactions on  Computers, vol.38, No. 5, May 1989, pp. 626-636.

 [10] R.K. Gupta, C.N. Coelho, G. De. Micheli," Program Implementation Schemes for Hardware - Software Codesign", IEEE Computer, pp. 48-55, June 1994.

[11]  Goutam K. Saha, “Transient  Fault Tolerant Processing in a RF  Application,”  International Journal – System Analysis Modelling Simulation,  vol. 38, pp.81-93, 2000, Gordon and Breach, USA.

[12]  R.K. Gupta, C.N. Coelho Jr., G.De. Micheli, "Sysnthesis and Simulation of Digital Systems Containing Interacting Hardware and Software Components", Proc. Design Automation Conference, June 1992.

[13] Yervant Zorian, Dimitris Gizopoulos, "Design for Yield and Reliability",  IEEE Design & Test, May/June, 2004.

[14] G.K. Saha, "Designing an EMI Immune Software for Microprocessor Based Application," Proceedings 11th  IEEE International Symposium, EMC'95, Switzerland, pp. 401-404, March, 1995.

[15]   A. Benso, P.L. Civera, M. Rebaudengo, M. Sonza Reorda, "An Integrated HW and SW  Fault Injection Environment for Real -Time  Systems", Proc. IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, pp. 117-122, 1998.

[16]  M-C Hsueh, T.K. Tsai, and  R.K. Iyer, "Fault Injection Techniques and Tools," IEEE Computer, pp. 75-82, April 1997.

[17] Goutam Kumar Saha, "Fault Tolerant Computation for a Scientific Application,"  CSI Communications, Computer Society of India Press Mumbai, Vol. 20(5), May 1996.

 [18] Goutam Kumar Saha, "EMP - Fault Tolerant Computing: A New Approach,"  International Journal of  Microelectronic Systems Integration, Vol. 5(3), pp. 183-193,      Plenum  Publishing Corp, USA, 1997.

 [19]  Goutam Kumar Saha, " Software Implemented Fault Tolerance Through Data Error Recovery," ACM Ubiquity, Vol 6(35), ACM Press, 2005.

 

Author’s Biography: In his last seventeen years’  research and development experience, he  has worked as a scientist  in  LRDE, Defence Research & Development Organisation, Bangalore,  and  at  Electronics  Research & Development Centre of  India , Calcutta.  At present,  he  is  with  Centre for  Development of  Advanced Computing, Kolkata, India, as a Scientist-F.  He has also taught post-graduate level Computer Science courses for five years. He has authored around hundred research papers in various International Journals, Magazines and Conference Proceedings including SAMS Journal, ACM, C&EE J., IEEE, CSI etc.  He is a senior member in IEEE, Computer Society of India, ACM and a fellow member in IETE, MSPI, IMS etc. He has received various awards, scholarships and grants from national and international organizations. He is a referee for AMSE journals (France), IJCSA, CSI Journals, IJCPOL and IEEE Potentials Magazine etc. His field of interest is on dependable, fault tolerant computing and NLE.  He is an associate editor of the ACM Ubiquity. He can be reached via sahagk@gmail.com, gksaha@rediffmail.com.



COMMENTS

Hi I can get pdf file for this reached Thank you

— Abdrazak, Fri, 13 Jan 2012 20:37:18 UTC

POST A COMMENT
Leave this field empty