A low-cost testing for transient faults

Ubiquity, Volume 2006 Issue January | BY Goutam Kumar Saha

Full citation in the ACM Digital Library

Unconventional and low - cost software implemented testing technique for processor transient faults are briefly discussed here. On-line signatures of the Processor Status Register have been used here for detecting transient faults.

Introduction:
Potential electrical transients often affect computer's primary memory or random access memory (RAM). RAM often experiences transient bit errors while a program is being executed. Transients cause random bit errors. For this reason, even software, without any design error, often produces wrong result when it is executed at an electrically noisy industrial environment. Faults in program control flow or in data might cause wrong results. This communication aims at modeling an online test case which is capable of detecting multiple bit errors at various locations on RAM during the program execution and to gain a reliable and fault tolerant computing. The proposed unconventional software technique is based on fail stop failure model. It takes necessary recovery actions also immediately after an error is detected, in order to stop error propagation and thus to eliminate ambiguous results caused by potential transients. The conventional error codes for example, parity bit, checksum; Hamming codes can detect multiple errors but cannot repair all errors. Moreover, all these codes are implemented in hardware because their software implementations suffer from high overhead with both time and space redundancy.

The Proposed Work:
However, the proposed software implemented technique injects the code of "No Operation" instruction at various locations inside the computing application program and then verifies for possible "No Operation" code corruption in order to validate the code immunity during run time. The more is the number of "No Operation" code injection, the higher is the fault coverage. It is certain that if a "No Operation" code is corrupted then the adjacent application program codes might also be corrupted. The affected codes are recovered by reloading the application codes or by copying back from a master copy and then the application is re-executed. Again, at some cases, the corrupted codes can be recovered also by using three images of the application where triple memory redundancy (TMR) can be afforded. This technique is a low cost solution towards transient fault tolerance. It has an affordable and less redundancy with both time and space. Using an affordable high-speed machine one can overcome the overhead with little extra execution time. The technique is useful also for locating faults because the locations at which "No Operation" code is injected are known. The choice of NOP-code is guided by the knowledge that it is only one byte long and execution of it does not change the processor-status-word (PSW). It provides a delay of one machine-cycle in order to subdue the presence of transients. During run time we can also compare two PSW s (one before a NOP code and another one after the NOP-code) in order to verify the transient-immunity of the processing environment.

Author's Biography

Goutam Kumar Saha [email protected] or [email protected] has been working as a Computer Scientist for last seventeen years. He has worked in various renowned research organizations namely, at LRDE, Defence Research & Development Organisation (DRDO), Bangalore, and at the Electronics Research & Development Centre of India (ER&DCI) Calcutta. At present, he is working at the Centre for Development of Advanced Computing (CDAC), Kolkata as a Scientist-F. He has authored many research papers on fault tolerant computing and natural language engineering. He is a senior member in IEEE (USA), ACM, Computer Society of India (CSI). He is a Fellow Member in IETE, MSPI (New Delhi) and in IMS (Goa). He received various grants & awards from international and national reputed institutions. He is a referee of CSI Journal, IJCPOL, AMSE Journal (France/Spain) and of the IEEE Potentials Magazine. He is an associate editor of the ACM Ubiquity.

COMMENTS

Articles

A low-cost testing for transient faults