Context
It is assumed that the control program and its logic were carefully designed and thoroughly tested. Nonetheless, programs can contain defects. Also, an unsupervised control system often runs in “hostile” environments such as factory floors. Peripheral devices may be connected via cables outside the controller’s housing, for example. Connection issues, or glitches in data transmission, can result in missing or erroneous data, and thus include run-time errors not due to the programmed logic.
Note that simply aborting a faulty program or process, or part thereof, is not an option. The system must attempt to autonomously get the control program and its processes into a stable and predictable state again.
Overview
Run-time errors are detected and handled by two mechanisms:
-
Traps: traps are run-time checks inserted by the Oberon compiler, hence detected by the running program itself. They are handled by the handler
Errors.trap
, which is installed at a defined, fixed memory address, and is invoked by a BL instruction, not as an interrupt handler (unlike, say, a Cortex-M processor). -
Aborts: aborts are detected and registered by hardware, since they handle error conditions that cannot (or only with a massive overhead) be detected by the running program. They are handled by
Errors.reset
.
Handled Run-time Errors
Traps
The Oberon compiler inserts the following run-time checks, resulting in traps if violated:
- array index of of bounds
- type guard failure
- array or string copy overflow
- access via NIL pointer
- illegal procedure call
- integer division by zero
- assertion violation
Aborts
-
Watchdog: the watchdog is a hardware device that detects a process that does not yield control in time, or not at all anymore. The watchdog timer requires to be reset from software before it expires.
The watchdog timer is reset by the scheduler before each execution of a process. It can be disabled by a process. For example, commands are executed with the watchdog disabled.
-
Stack Overflow: the stack overflow monitor is a hardware device that detects when the stack pointer reaches into the so called hot zone of the stack memory area, or even outside this area.
-
Not Alive Process: using the timeout functionality (see module Processes), a process can detect if it is “not alive”, ie. not being activated by the scheduler within a defined time period. The process can then attempt to handle the situation. For example, a hardware device just needs to be queried again. As ultimate measure an abort can be triggered.
-
Kill/abort Push Button: the manual kill/abort push button triggers an abort.
Note: the abort button of the stock Embedded Project Oberon system, which hardware-resets the system, but does not reload the system software, has been replaced by the above kill/abort button, plus a reset button that hardware-resets the system, and always reloads the system software.
Handlers
Trap Handler
When encountering a trap, the trap handler Errors.trap
is called. This handler registers the error number and error code address in the system control and status registers (SCS), and resets the system. The bootloader not load restart the system, but immediately branch to Errors.reset
, similar to the abort handler in Embedded Project Oberon.
Abort Handler
When an abort error is detected by the hardware, it registers the error number and error code address in the system control and status registers (SCS), and resets the system. Thereafter, the procedure is the same as for traps.
Error Handler
The error handler Errors.error
implements the error recovery. It does not return: either the system is reset or restarted, or the scheduling is reset and starts to execute the processes depending on the recovery conditions and procedure.
Error in Error Handling
If a trap or abort occurs during error handling, a system fault is logged, and the system is restarted.
Should any of the error handlers return, a system fault is logged, and the system is restarted.
Policy and Concepts vs. Mechanisms
Separation of Concerns
The policy, concepts, and implementation for handling of all run-time errors is contained in module Errors
. Module Processes
, which encompasses the definitions and implementation for the coroutine-based processes, provides the required mechanisms.
Process Recover, Reset, Remove
Processes can individually be reset or removed.
-
Reset: the coroutine is reset and will start executing its code from the beginning when scheduled. All hardware-based timers for that process are disabled. The process can use its initialisation code to restore a specific state as needed, and restart from there, to implement a recovery in lieu of a full reset.
-
Remove: the process is removed from the list, and thus not scheduled again.
Error Recovery Approach
TBD. See the GitHub repo for the currently implemented approach, mainly module Errors
.
Logging
Run-time errors and any corrective measures are logged, using the FPGA-based logger. Logging into the FPGA is fast, and the log entries survive a system reset. By default, no error messages are printed to the console, as this would be too time consuming. The log entries can be displayed using a separate module and command.