Run-time Errors

Context

It is assumed that the control program and its logic were carefully designed and thoroughly tested. Nonetheless, programs can contain defects. Also, an unsupervised control system often runs in “hostile” environments such as factory floors. Peripheral devices may be connected via cables outside the controller’s housing, for example. Connection issues, or glitches in data transmission, can result in missing or erroneous data, and thus include run-time errors not due to the programmed logic.

Note that simply aborting a faulty program or process, or part thereof, is not an option. The system must attempt to autonomously get the control program and its processes into a stable and predictable state again.

Overview

Run-time errors are detected and handled by two mechanisms:

  • Traps: traps are run-time checks inserted by the Oberon compiler, hence detected by the running program itself. They are handled by the handler Errors.trap, which is installed at a defined, fixed memory address, and is invoked by a BL instruction, not as an interrupt handler (unlike, say, a Cortex-M processor).

  • Aborts: aborts are invoked by a hardware interrupt, since they handle error conditions that cannot (or only with a massive overhead) be detected by the running program. The interrupt is of the non-returning type. After executing the interrupt-specific handler itself, an abort is handled by the handler Error.abort, which is installed at a known, fixed memory address. It invoked by loading that address into the program counter register in the RISC5 hardware in lieu of the PC value that was saved when entering interrupt handling.1

Handled Run-time Errors

Traps

The Oberon compiler inserts the following run-time checks, resulting in traps if violated:

  1. array index of of bounds
  2. type guard failure
  3. array or string copy overflow
  4. access via NIL pointer
  5. illegal procedure call
  6. integer division by zero
  7. assertion violation

Aborts

  • Watchdog: the watchdog is a hardware device that detects a process that does not yield control in time, or not at all anymore. The watchdog timer requires to be reset from software before it expires, else it triggers a non-returning abort interrupt.

    The watchdog timer is reset by the scheduler before each execution of a process. It can be disabled by a process. For example, commands are executed with the watchdog disabled.

  • Stack Overflow: the stack overflow monitor is a hardware device that detects when the stack pointer reaches into the so called hot zone of the stack memory area, or even outside this area. If the stack pointer reaches into the hot zone, a non-returning abort interrupt is triggered.

    If the stack pointer reaches beyond (below) the stack memory area, the system is reset.

  • Not Alive Process: using the timeout functionality (see module Processes), a process can detect if it is “not alive”, ie. not being activated by the scheduler within a defined time period. The process can then attempt to handle the situation. For example, a hardware device just needs to be queried again. As ultimate measure a non-returning abort interrupt can be triggered.

  • Kill/abort Push Button: the manual kill/abort push button triggers a non-returning abort interrupt.

    Note: the abort button of the stock Embedded Oberon system, which hardware-resets the system, but does not reload the system software, has been replaced by the above kill/abort button, plus a reset button that hardware-resets the system, and always reloads the system software.

Handlers

Trap Handler

When encountering a trap, the trap handler Errors.trap is called. This handler logs the event, and calls the error handler Errors.error. The error handler will not return.

Abort Handler

When a non-returning interrupt is triggered, the abort handler Errors.abort is executed. This handler logs the event, and calls the error handler Errors.error. The error handler will not return.

Error Handler

The error handler Errors.error implements the error recovery attempts (see section Stages below). It does not return: either the system is reset and restarted, or the scheduler is reset and starts to execute the processes that have been reset during recovery.

Error in Error Handling

If a trap or abort occurs during error handling, a system fault is logged, and the system is restarted.

Should the error handler return, a system fault is logged, and the system is restarted.

Policy and Concepts vs. Mechanisms

Separation of Concerns

The policy, concepts, and implementation for handling of all run-time errors is contained in module Errors. Module Processes, which encompasses the definitions and implementation for the coroutine-based processes, provides the required mechanisms. In general, error handling needs to be tailored to the specific control application.

Process Recover, Reset, Remove

Processes can individually be reset or removed.

  • Reset: the coroutine is reset and will start executing its code from the beginning when scheduled. All hardware-based timers for that process are disabled. The process can use its initialisation code to restore a specific state as needed, and restart from there, to implement a recovery in lieu of a full reset.

  • Remove: the process is removed from the list, and thus not scheduled again. The process' finalise error handler is executed, eg. to free resources, or release a claimed semaphore.

Error Recovery Approach

Basics

While single processes could be targeted for recovery upon failure, the current implementation uses a system-level approach for simplicity. If needed, very sophisticated error recovery approaches could be implemented, but this gets complex and hairy pretty quickly.

Panic Mode

It’s important to remember that the system is assumed to run unsupervised by any human, and autonomously tries to get itself into an operational and stable state. The system is in panic mode, and quick and possibly “harsh” measures can be warranted. Nonetheless, some subtlety is possible, short of just resetting and reloading the system in each case.

System OK

The audit process checks that the system is stable (again). If no error condition is detected, and handled, for a specified number of process runs, it will reset the system-wide error conditions as explained below.

Error State and Restart Counter

Module Errors maintains an error state condition, which is part of the system control register in the FPGA and thus survives a system reset. This error condition is zero with a stable system, and is then incremented with each subsequent run-time error, thusly escalating the severity of the error handling. The audit process will reset this error condition to zero if, or when, the system is stable again.

The top of the escalating error handling chain is a system restart, ie. reloading the system from disk. Module Errors maintains a counter (also part of the system control register) of the number of restarts due to error handling. This allows to prevent the system from reloading endlessly without ever reaching a stable state again. The audit process will reset this restart counter if, or when, the system is stable again.

Stages

As mentioned above, the error handling and system recovery must be designed and adjusted for the specific control application. As implemented now, the error handling goes through the following stages of escalating corrective measures, unless the audit process resets the error state to zero.

  1. If the error state is 0, all processes are reset. Child processes are removed, to allow their parent process to set them up properly again.
  2. If the error state is incremented to 1, all Processes.SystemProc and Processes.EssentialProc processes are reset, child processes are removed. Processes.OtherProc processes are removed.
  3. If the error state is incremented to 2, and the system has not been restarted yet, indicated by the restart counter, the system is hardware-reset and reloaded from disk, using the currently selected start-up table to restart the corrensponding control program.
  4. If the error state is 2, and the system has been restarted once, the system is restarted again, now using start-up table 0, to restart the initial control program.

After that sequence, autonomous error recovery has to give up, and the system is halted.

Note that halting the system might not be appropriate, depending on the overall design of the control system as well as the controlled system. Well-designed controlled systems often go into a defined state without the control system present or active. To achieve this, a mechanism to take the control system completely out of the control loop might be required as part of the hardware design, disconnecting any actuators.

Also, restarting the initial control program (start-up table 0) may or may not be appropriate, for example if the controlled system is physically in a state where the initial control program is useless, or even detrimental.

Manual Kill/Abort Push Button

Pressing the manual kill/abort push button results in resetting the Processes.SystemProc processes and removing all others.

Logging

Run-time errors and any corrective measures are logged, using the FPGA-based logger. Logging into the FPGA is fast, and the log entries survive a system reset. By default, no error messages are printed to the console, as this would be too time consuming. The log entries can be displayed using a separate module and command.


  1. Which is why the interrupt does not return. ↩︎