acm - an acm publication

Articles

Software-Implemented Fault Detection Approaches

Ubiquity, Volume 2008 Issue May | BY Goutam Kumar Saha 

|

Full citation in the ACM Digital Library

A comparative study on various software-implemented fault detection approaches has been briefly described in a tabular form


Software Based Fault Detection Technique

Description

1. Algorithm Based Fault Tolerance (ABFT) ABFT refers to a self-contained method for detecting, locating, and correcting faults with a software procedure. It exploits the structure of numerical operations. This approach is effective but lacks of generality. It is well suited for applications using regular structures, and therefore its applicability is valid for a limited set of problems.

2. Assertions Assertions or the logic statements inserted at different points in the program reflect invariant relationships between the variables of the program and they often lead to various problems as assertions are not transparent to a programmer and their effectiveness depends on the nature of an application and on a programmer's ability.

3. Control Flow Checking (CFC) The basic task of CFC is to partition an application program in basic blocks or the branch-free parts of code. A deterministic signature (or number) is assigned to each block and faults are detected by comparing the run-time signature with a pre-computed one. In most CFC techniques one of the major problems is to tune the test granularity that should be used.

4. Procedure Duplication (PD) The programmer decides to duplicate the most critical procedures and to compare the obtained results on executing the procedures on two different processors. This approach requires a programmer to decide which procedures to be duplicated and to introduce proper checking on the results. These code modifications are done manually and might introduce errors. Code duplication for byte-error detection by a single processor.

5. Error Detection by Duplicated Instructions (EDDI) Computation results from master and shadow instructions are compared before writing to memory. Upon mismatch, the program jumps to an error handler that will cause the program to restart. EDDI has high error coverage at the cost of performance penalty due to time redundancy as introduced into the system. Since we use general purpose registers as shadow registers, more register spilling occurs with EDDI. More spilling causes more performance overhead since it increases the number of memory operations.

6. Software Implemented Error Detection and Correction (EDAC) Software Implemented EDAC approaches (e.g., Cyclic Redundancy Checks or CRC, Hamming Codes, Bose-Chaudhuri- Hocquenghem or BCH etc,) are effective in error detection but they suffer from very high time overhead. Hamming, BCH and RS codes have nice mathematical structures. However, there is a limitation when it comes to code lengths. These conventional error correcting codes namely, They have limitations and there exists very high time redundancy when they are implemented by software. When the check-bits become erroneous, the stored check bits do not match the computation and as a result, a block code fails. In general, checksum schemes fail when they are corrupted enough to transform to another valid code work (the distance of the code).

7. Periodic Memory Scrubbing This approach relies on periodic reloading of code on main memory from an immutable memory. This is effective for protecting the code segment of Operating system and application programs. Performance penalty is due to repetitive memory reading.

8. Masking Redundancy This approach means running an application in the presence of faults. Few processors are used to run the same program and vote to identify errors in any single processor. Errors can be masked from application software. No software rollbacks are required to fix errors.

9. Reconfiguration This means removing failed modules from the system. When failure occurs in a module, its effects on the remaining portion of the system is isolated. A large number of functional modules are used, which are switched automatically to replace a failing module.

10. Replication This ensures reliability but is expensive in terms of hardware or runtime cost. The idea is to take a majority vote on a calculation replicated N times. Its software solution requires each processor to run N copies of surrounding computations and then vote on the result. This slows down the computation by at least a factor of N.

11. Restore Architecture Transient errors or soft errors are detected through time redundancy in the ReStore architecture. The novelty of the ReStore architecture is the use of transient error symptoms, such as, memory protection violation and incorrect control flow etc. The tendency for these symptoms to occur quickly after a transient, coupled with a check pointing implementation in hardware to restore clean architectural state, enables a cost effective soft error detection and recovery solution.

12. Dual Modular Redundancy (DMR) & Backward-Error Recovery (BER) & Checkpoint Error is detected through differences in execution across a dual modular redundant (DMR) processor pair. DMR is a backward-error recovery (BER) technique where two processors are used to detect errors in execution. BER mechanisms create checkpoints of correct system state and rollback processor execution when an error is detected. A checkpoint of program state consists of a snapshot of architectural registers and memory values. A checkpoint logically represents a single point in time.

13. Triple Modular Redundancy (TMR) & Forward - Error Recovery (FER) Three processors execute the same program and when one processor fails a majority vote, it determines the erroneous processor. TMR is the classic example of FER where enough redundancy exists in the system to determine the correct operation, should a processor fail.

14. Fingerprinting This mechanism detects differences in execution across a dual modular redundant (DMR) processor pair. It summarizes a processor's execution history in a hash-based signature. Differences between two mirrored processors are exposed by comparing their fingerprints.

15. Processor Status Word Tracking (PSWT) During the execution of a program if at any point of time a PSW is found to be an invalid one then we say that an error has occurred. Invalid PSW means it does not match to any one of our known or valid PSWs in the PSW bank meant for that program.

16. Application Semantic Based Assertions (ASBA) We apply various assertions that are derived from our understanding about the semantics of an application. Any violation at an assertion indicates an erroneous state.

17. Checksum & Parity They are effective for bit error detection but not suitable for error correction. The single parity checks can detect only odd number of single bit errors. Any even number of single bit-errors remains undetected. In a typical Checksum where n bytes are XORed and the result is stored in (n+1)th byte. Now if this byte itself is corrupted due to transients or in the case of even changes, the errors remain undetected by this typical Checksum.

18. Matrix Checksums By using typical checksums for each row and column in a matrix, we can detect erroneous element of a matrix. This is useful for detecting errors in application data.

19. Arithmetic Sum & Difference Checks for a pair of elements

This approach is useful for detecting and correcting multiple bits errors in data words. Even all bits errors in a data word are corrected by this approach.

20. The NOP-PSW Approach This is useful for detecting transient errors or soft-errors in microprocessor registers, memory and stack area that might occur during the operational time at various industrial environments. This generalized approach does not need multiple processors and multiple software design. This is a single - version low-cost but an efficient (fast and having low memory-space overhead) approach for tolerating transient faults. Code size grows by 15% and execution time increases by 20.2% (as discussed in section-3). Such redundancy is negligibly small in comparison to other existing techniques. Memory As discussed This is also useful for processor hardening and transient susceptibility testing. This approach cannot detect all control flow errors. This is more efficient than the conventional software implemented EDAC, PD, scrubbing, masking, DMR or TMR, fingerprinting etc. This novel NOP-PSW approach is intended to be an efficient supplement one to be used along with other prevailing software-based fault tolerance approaches. This approach is very useful for designing fault tolerant microprocessor based systems using COTS components as the Electromagnetic Interference (EMI) or transients or radiation hardened components are very costly ones. The approach is also useful for software based fault avoidance.


Table 1. Software Implemented Fault Detection Approaches


References:

Goutam Kumar Saha, "Software-based, Low-Cost Fault Detection for Microprocessors," IEEE Potentials, Vol. 27, No. 1, pp. 37-41, Jan-Feb 2008, IEEE Press, USA.

Goutam Kumar Saha, "Software Based Fault Tolerant Computing," ACM Ubiquity, Vol.6, No. 40, Nov 2005, ACM Press, USA.

Goutam Kumar Saha, "A Software Fix Towards Fault Tolerant Computing," ACM Ubiquity, Vol.6, No. 16, May 2005, ACM Press, USA.

Goutam Kumar Saha, "Software Based Fault Tolerance - a Survey," ACM Ubiquity, Vol.7, No. 25, pp.1-15, July 2006, ACM Press, USA.

Goutam Kumar Saha, "Software Based Fault Tolerant Array," IEEE Potentials, Vol. 25, No. 1, Jan-Feb 2006, IEEE Press, USA.

Goutam Kumar Saha, "Transient Fault Tolerance through Algorithms," IEEE Potentials, Vol. 25, No. 5, pp. 25-30, IEEE Press, Sep-Oct 2006, USA.

G.K. Saha, "Designing an EMI Immune Software for Microprocessor Based Application," Proceedings of the 11th IEEE International Symposium, EMC'95, Zurich, March 1995, (presented paper), pp. 401-404.

Goutam Kumar Saha, "Low-Cost, Fault Tolerance Applications," IEEE Potentials, Vol. 24, No. 4, pp. 35-39, 2005, IEEE Press, USA.

Goutam Kumar Saha, "Software Fault Tolerance through Run-Time Fault Detection," ACM Ubiquity, Vol. 6, No. 46, pp. 1-5, ACM Press, December 2005, USA.

Goutam Kumar Saha, "Application Semantic Driven Assertions toward Fault Tolerant Computing," ACM Ubiquity, Vol. 7, No. 22, pp. 1-27, ACM Press, June 2006, USA.

Goutam K. Saha, "Transient Fault Tolerant Processing in a RF Application," International Journal - System Analysis Modelling Simulation, Vol. 38, pp.81-93, 2000, Gordon and Breach, USA. Goutam Kumar Saha, "Software Implemented Fault Tolerance through Data Error Recovery," ACM Ubiquity, Vol. 6, No. 35, pp. 1-8, ACM Press, September 2005, USA.


Source: Ubiquity Volume 9, Issue 18 (May 6, 2008 - May 12, 2008)

COMMENTS

POST A COMMENT
Leave this field empty