Run-time Errors – Oberon RTS

Context

It is assumed that the control program and its logic were carefully designed and thoroughly tested. Nonetheless, programs can contain defects. Also, an unsupervised control system often runs in “hostile” environments such as factory floors. Peripheral devices may be connected via cables outside the controller’s housing, for example. Connection issues, or glitches in data transmission, can result in missing or erroneous data, and thus include run-time errors not due to the programmed logic.

Note that simply aborting a faulty program or process, or part thereof, is not an option. The system must attempt to autonomously get the control program and its processes into a stable and predictable state again.

Overview

Run-time errors are detected and handled by two mechanisms:

Traps: traps are run-time checks inserted by the Oberon compiler, hence detected by the running program itself. They are handled by the handler Errors.trap, which is installed at a defined, fixed memory address, and is invoked by a BL instruction, not as an interrupt handler (unlike, say, a Cortex-M processor).
Aborts: aborts are detected and registered by hardware, since they handle error conditions that cannot (or only with a massive overhead) be detected by the running program. They are handled by Errors.reset.

Handled Run-time Errors

Traps

The Oberon compiler inserts the following run-time checks, resulting in traps if violated:

array index of of bounds
type guard failure
array or string copy overflow
access via NIL pointer
illegal procedure call
integer division by zero
assertion violation

Aborts

Watchdog: the watchdog is a hardware device that detects a process that does not yield control in time, or not at all anymore. The watchdog timer requires to be reset from software before it expires.

The watchdog timer is reset by the scheduler before each execution of a process. It can be disabled by a process. For example, commands are executed with the watchdog disabled.
Stack Overflow: the stack overflow monitor is a hardware device that detects when the stack pointer reaches into the so called hot zone of the stack memory area, or even outside this area.
Not Alive Process: using the timeout functionality (see module Processes), a process can detect if it is “not alive”, ie. not being activated by the scheduler within a defined time period. The process can then attempt to handle the situation. For example, a hardware device just needs to be queried again. As ultimate measure an abort can be triggered.
Kill/abort Push Button: the manual kill/abort push button triggers an abort.

Note: the abort button of the stock Embedded Project Oberon system, which hardware-resets the system, but does not reload the system software, has been replaced by the above kill/abort button, plus a reset button that hardware-resets the system, and always reloads the system software.

Handlers

Trap Handler

When encountering a trap, the trap handler Errors.trap is called. This handler registers the error number and error code address in the system control and status registers (SCS), and resets the system. The bootloader not load restart the system, but immediately branch to Errors.reset, similar to the abort handler in Embedded Project Oberon.

Abort Handler

When an abort error is detected by the hardware, it registers the error number and error code address in the system control and status registers (SCS), and resets the system. Thereafter, the procedure is the same as for traps.

Error Handler

The error handler Errors.error implements the error recovery. It does not return: either the system is reset or restarted, or the scheduling is reset and starts to execute the processes depending on the recovery conditions and procedure.

Error in Error Handling

If a trap or abort occurs during error handling, a system fault is logged, and the system is restarted.

Should any of the error handlers return, a system fault is logged, and the system is restarted.

Policy and Concepts vs. Mechanisms

Separation of Concerns

The policy, concepts, and implementation for handling of all run-time errors is contained in module Errors. Module Processes, which encompasses the definitions and implementation for the coroutine-based processes, provides the required mechanisms.

Process Recover, Reset, Remove

Processes can individually be reset or removed.

Reset: the coroutine is reset and will start executing its code from the beginning when scheduled. All hardware-based timers for that process are disabled. The process can use its initialisation code to restore a specific state as needed, and restart from there, to implement a recovery in lieu of a full reset.
Remove: the process is removed from the list, and thus not scheduled again.

Error Recovery Approach

TBD. See the GitHub repo for the currently implemented approach, mainly module Errors.

Logging

Run-time errors and any corrective measures are logged, using the FPGA-based logger. Logging into the FPGA is fast, and the log entries survive a system reset. By default, no error messages are printed to the console, as this would be too time consuming. The log entries can be displayed using a separate module and command.