Run-time Errors

Introduction

This document describes the currently implemented error handling and recovery of Oberon RTS.

Context

It is assumed that the control program and its logic were carefully designed and thoroughly tested. However, an unsupervised control system often runs in “hostile” environments such as factory floors. Peripheral devices may be connected via cables outside the controller’s housing, for example. Connection issues, or glitches in data transmission, can result in missing or erroneous data, and thus run-time errors not due to the programmed logic.

Handled Run-time Errors

Traps

The Oberon compiler inserts the following run-time checks, resulting in traps if violated:

  1. array index of of bounds
  2. type guard failure
  3. array or string copy overflow
  4. access via NIL pointer
  5. illegal procedure call
  6. integer division by zero
  7. assertion violation

A trap results in executing the corresponding handler, which is installed at a known, fixed memory address. Note that the trap handler is invoked by a BL instruction, not as an interrupt handler (unlike, say, a Cortex-M processor).

Watchdog

The watchdog is a hardware device that detects a process that does not yield control in time, or not at all anymore. The watchdog timer requires to be reset from software before it expires, else it triggers a non-returning interrupt.

Stack Overflow

The stack overflow monitor is a hardware device that detects when the stack pointer reaches into the so called hot zone of the stack memory area, or even outside this area. If the stack pointer reaches into the hot zone, a non-returning interrupt is triggered.

If the stack pointer reaches beyond (below) the stack memory area, the system is reset.

Not Alive Process

Using the timeout functionality (see module Threads), a process can detect if it is “not alive”, ie. not being activated by the scheduler within a defined time period. The process can then attempt to handle the situation. For example, a hardware device just needs to be queried again. As ultimate measure a non-returning interrupt can be triggered.

Kill/abort Push Button

The manual kill/abort push button triggers a non-returning interrupt.

Note: the abort button of the stock Embedded Oberon system, which hardware-resets the system, but does not reload the system software, has been replaced by the above kill/abort button, plus a reset button that hardware-resets the system, but always reloads the system software.

Policy and Concepts vs. Mechanisms

Separation of Concerns

The policy, concepts, and implementation for handling of all run-time errors is contained in module Errors. Module Threads, which encompasses the definitions and implementation for the coroutine-based processes, provides the required mechanisms. In general, error handling needs to be tailored to the specific control application.

Process Recover, Reset, Remove

Processes can individually be reset or removed.

  • Reset: the coroutine is reset and will start executing its code from the beginning when scheduled. All hardware-based timers for that process are disabled. The process can use its initialisation code to restore a specific state as needed, and restart from there, to implement a recovery in lieu of a full reset.

  • Remove: the process is removed from the list, and thus not scheduled again. The process' finalise error handler is executed, eg. to free resources, or release a claimed semaphore.

Error Recovery Approach

Basics

While single processes could be targeted for recovery upon failure, the current implementation uses a system-level approach for simplicity. If needed, very sophisticated error recovery approaches could be implemented, but this gets complex and hairy pretty quickly.

Panic Mode

It’s important to remember that the system is assumed to run unsupervised by any human, and autonomously tries to get itself into an operational and stable state. The system is in panic mode, and quick and possibly “harsh” measures can be warranted. Nonetheless, some subtlety is possible, short of just resetting and reloading the system in each case.

System OK

The audit process checks that the system is stable (again). If no error condition is detected, and handled, for a specified number of process runs, it will reset the system-wide error conditions as explained below.

Error State and Restart Counter

Module Errors maintains an error state condition, which is part of the system control register in the FPGA and thus survives a system reset. This error condition is zero with a stable system, and is then incremented with each subsequent run-time error, thusly escalating the severity of the error handling. The audit process will reset this error condition to zero if, or when, the system is stable again.

The top of the escalating error handling chain is a system restart, ie. reloading the system from disk. Module Errors maintains a counter (also part of the system control register) of the number of restarts due to error handling. This allows to prevent the system from reloading endlessly without ever reaching a stable state again. The audit process will reset this restart counter if, or when, the system is stable again.

Stages

As mentioned above, the error handling and system recovery must be designed and adjusted for the specific control application. As implemented now, the error handling goes through the following stages of escalating corrective measures, unless the audit process resets the error state to zero.

  1. If the error state is 0, all processes are reset.
  2. If the error state is incremented to 1, all system and essential processes are reset; other processes are removed (killed).
  3. If the error state is incremented to 2, and the system has not been restarted yet, indicated by the restart counter, the system is hardware-reset and reloaded from disk, using the selected start-up table to restart the current control program.
  4. If the error state is 2, and the system has been restarted once, the system is restarted again, now using start-up table 0, to restart the initial control program.

After that sequence, autonomous error recovery has to give up, and the system is halted.

Note that halting the system might not be appropriate, depending on the overall design of the control system as well as the controlled system. Well-designed controlled systems often go into a defined state without the control system present or active. To achieve this, a mechanism to take the control system completely out of the control loop might be required as part of the hardware design, disconnecting any actuators.

Also, restarting the initial control program may or may not be appropriate, for example if the controlled system is physically in a state where the initial control program is useless.

Manual Kill/Abort Push Button

Pressing the manual kill/abort push button results in resetting the system processes and removing all others.

Logging

Run-time errors and any corrective measures are logged, using the FPGA-based logger, described elsewhere. Logging into the FPGA is fast, and the log entries survive a system reset. By default, no error messages are printed to the console, as this would be too time consuming. The log entries can be displayed using a separate module and command.