Software-Based Fault Tolerant Computing
Ubiquity, Volume 2005 Issue November | BY Goutam Kumar Saha
This paper describes how to design a software-based fault tolerant application using microprocessor (MP), in order to tolerate the burst errors in memory. This approach may be called a single -- version scheme (SVS). The SVS relies on a single version application program which is enhanced with self-checking code redundancy to tolerate memory burst errors that are difficult to correct during the run-time of an application. Conventionally, the other software based approaches can detect a few bit errors (in memory) only towards fail-stop kind of fault tolerance against transient bit errors. Reed Solomon codes are mainly effective for burst errors in coding of audio Compact Disks at offline only. The proposed online technique does not need multiple versions of software and multiple machines. This approach employs only two copies of the application software running on one machine only. Two copies of the enhanced version version of an application are used here for online error detection and tolerance thereof as well. This is an effective low-cost online tool for hardening a microprocessor-based industrial computing system or for on-chip DRAM applications using an affordable code and time redundancy against the burst errors in processor memory. The SVS aims to provide a non-fail-stop kind of fault tolerance against burst errors. This approach supplements the Error Correcting Codes (ECC) in memory system also, against both the transient and permanent bit errors in memory.