acm - an acm publication

Articles

Software fault tolerance through run-time fault detection

Ubiquity, Volume 2005 Issue December | BY Goutam Kumar Saha 

|

Full citation in the ACM Digital Library

Electrical transients often disrupt the proper functioning of a program. It causes the errors in program flow, data, program codes, or processor registers. The aim of this article is to detect transient faults as quickly as possible in order to prevent functions being performed wrongly or data being lost, during the execution of an application program. Recovery work is initiated immediately after the detection of errors for gaining high software fault tolerance and dependable computing. Transient errors are detected here, on tracing the presence of an odd processor status word (PSW) during the execution time of a computing application.


Electrical transients often disrupt the proper functioning of a program. It causes the errors in program flow, data, program codes, or processor registers. The aim of this article is to detect transient faults as quickly as possible in order to prevent functions being performed wrongly or data being lost, during the execution of an application program. Recovery work is initiated immediately after the detection of errors for gaining high software fault tolerance and dependable computing. Transient errors are detected here, on tracing the presence of an odd processor status word (PSW) during the execution time of a computing application.

  1. Introduction:
    Failure means the program in its functioning has not met user requirements in some way. A system fails when it cannot meet its promises. An error is a part of a system's state that may lead to a failure. The cause of an error is called a fault. We can classify faults permanent, transient, or intermittent. A permanent fault is one that continues to exist until the faulty component is repaired. Transient fault occurs only once and we cannot trace it later on. If we repeat the operation, the fault goes away. An intermittent fault becomes apparent not continuously but at irregular intervals. Fault tolerance means a system can provide its services even in the presence of faults. Safety refers to the situation that when a system temporarily fails to operate correctly, nothing catastrophic happens. Many process control systems, such as those used for chemical plants, sending people into space or for controlling nuclear power plants, are required to provide a high degree of safety. The objective of this article is to describe a new way of keeping track of the program flow while the program is being executed. Processor Status Register (PSR) is examined periodically in order to catch faults.
  2. Work Description:
    We need to identify the most sensitive or critical parts of a microprocessor or microcomputer-based application program. I/O instructions, conditional branching, decision generating instructions etc., are the examples of program sensitive parts. Let us consider S1, S2,�, Si be the typical sensitive points in a program. At all these sensitive points we need to create status_word_ banks with the contents of the PSR at those respective sensitive points taking into consideration of all possible inputs. Say, B1, B2,�, Bi be the set of Processor - Status - Word - Banks at the sensitive points namely, S1, S2, �, Si. We can create Bi by inserting PUSH PSW instruction after the sensitive point Si.
  3. The Noble Approach:
    The proposed design technique consists of the following steps.
    1. After a thorough study of the application system, system design engineers identify the critical points as say, {S1, S2, �, Si }.
    2. Keep separate processor-status-word-bank say, Bis for each and every sensitive points say, Sis in the application program.
    3. While program is being executed in a real-life industrial electrically hazardous environment, catch PSRi (using PUSH PSW, POP PSW instructions) just after the Si.
  4. Discussion:
    It is very important that every Bi should contain valid PSRS corresponding to all possible and valid input signals or information. It is not impossible to have a complete set of PSRS at each bank Bi. Once the banks are matured for a typical operational environment, system designer can achieve higher dependable computing system. However, it demands a thorough study of the application system as well as of the operational environment. Interested readers may refer to related works [1,2,3,4,5].
  5. Conclusion:
    This proposed approach is very low cost software solution towards higher fault tolerance by keeping tracks of the run time PSR contents and by comparing with the pre-stored PSR banks for each and every critical sections of the application system. It does not need multiple versions of the application systems. The redundancy in both time & space herein, can be easily afforded using today's modern and high speed computing system.
References:

[1] G.K. Saha, "Software Based EMI-Fault Tolerance in a PC Peripheral," International Journal - Cybernetics and System Analysis, vol. 32(5), Plenum Press, USA, 1996.

[2] Goutam Kumar Saha, "Software Based Fault Detection in Microprocessors," in press, IEEE Potentials, IEEE Press, USA, 2005.

[3] Goutam K Saha, "Software as a Tool to Control EMI/EMC in Designing Computers," IEEE - EUROEM Book of Abstracts, France, 1994.

[4] Goutam Kumar Saha, "Software Based Computing Security & Fault Tolerance," ACM Ubiquity, vol. 5(15), ACM Press, USA, June, 2004.

[5] G.K. Saha, "Noise Reduction in Computer Process Synchronization," Proceedings of the IEEE International Symposium INCEMIC'99, IEEE Catalog 99TH8487, pp. 443-444, New Delhi, 1999. About the Author

Goutam Kumar Saha has been working as a computer scientist in various premier R&D organizations in India for last seventeen years. He has worked in LRDE, DRDO, Bangalore, ER&DCI, Calcutta, and at present, he is with the Centre for Development of Advanced Computing, Kolkata, India, as a Scientist-F. He has authored more than one hundred research papers in various International Journals, Conference etc. He is a reviewer for CSI Journal, AMSE-Modeling Journal (France), IJCPOL and an IEEE Potentials Magazine. His field of interest is on fault tolerant computing software, dependable computing and natural language engineering. He has received many awards and scholarship. He is fellow member in IETE, MSPI, IMS, and Senior Member in IEEE, CSI, ACM and a member in the W3C Internationalization Tag Set Working Group. He can be reached via [email protected] or [email protected]

COMMENTS

Software fault tolerance through runtime fault detection is a proactive approach to ensuring the reliability and availability of software systems. It involves continuous monitoring, error detection, and recovery mechanisms that help the system withstand faults and continue functioning as intended. This is particularly important in critical systems such as aerospace, healthcare, and industrial automation where system failures can have serious consequences: https://www.jobz.pk/it-employment/

��� Mudassar, Thu, 28 Sep 2023 08:48:43 UTC

Software fault tolerance through runtime fault detection is a proactive approach to ensuring the reliability and availability of software systems. It involves continuous monitoring, error detection, and recovery mechanisms that help the system withstand faults and continue functioning as intended. This is particularly important in critical systems such as aerospace, healthcare, and industrial automation where system failures can have serious consequences.

��� Mudassar, Thu, 28 Sep 2023 08:48:11 UTC

POST A COMMENT
Leave this field empty