Error Detection and Recovery

Embedded Oberon

The Oberon compiler inserts the following run-time checks, resulting in traps if violated:

  1. array index of of bounds
  2. type guard failure
  3. array or string copy overflow
  4. access via NIL pointer
  5. illegal procedure call
  6. integer division by zero
  7. assertion violation

The trap number corresponds to the item number in this list. The trap handler in System.mod prints a message, and calls Oberon.Reset. If the trap was caused by a task, it is removed from the task list, hence not invoked again by the Loop. Oberon.Reset also resets the Loop, resetting the stack pointer to its startup value.

Assessment

While reasonable and sufficient for a user-supervised system, running in a controlled environment, such as an office or lab, it is not for an unsupervised control system, often running in more “hostile” environments such as factory floors. Peripheral devices may be connected via cables outside the controller’s housing, for example. Connection issues, or glitches in data transmission, can result in missing or erroneous data, and thus run-time errors not due to the programmed logic.

Also, simply removing one process might leave the overall system in an inconsistent or inoperable state, where essential functions of the control process are not executed anymore, which then results in errors or malfunctions in the controlled system.

A Basic Error Recovery Approach

Just like Embedded Oberon attempts to get the system into a stable and predictable state again by removing the faulty task, and restarting the Loop, Oberon RTS should attempt to get the control program and its processes into a stable and predictable state.

Considering that run-time errors can be caused by rare, hazardous events in the processor’s environment, as outlined above, the following recovery attempts can be pursued by Oberon RTS:

  • reset and restart the faulty process
  • reset and restart the whole system and control program

Faulty processes could also be simply shut down, if not essential for the system’s operations, eg. a process showing the current time. Even a process driving a display might not be essential for the control process proper, in particular in an error situation.

Of course, we also need a mechanism that breaks repeated process or system restarts within a defined period of time, as well as logging and alarm facilities to support and enable an operator to get on-site and investigate and fix the issue.

To get off the ground with a basic solution, let’s focus on the moment when the system is in panic mode, trying to achieve a stable state, without lots of bells and whistles.

Process Reset and Restart

A control process usually holds some state, be it implemented by module variables for a set of tasks that constitute the process, or by a coroutine itself. Upon a run-time error, a process should be reset to get it into a defined state (in the literal sense).

The trap handler calls Oberon.Reset, which is the entry point to begin with the recovery procedure. Oberon.Reset is also called from the abort handler System.Abort. A trap stems from a run-time error, while abort signifies a user interaction, therefore with potentially different handling strategies. Oberon.Reset gets an integer parameter to identify the origin of the call, which requires the small corresponding change in System.mod.

To avoid endlessly resetting and restarting the same process, in case the problem persists, each process gets a restart counter. Should that restart counter exceed a maximum limit, as a next step to get the system stable, the system is reset and restarted. The on-startup facility to autoload modules (to be described elsewhere) then restarts the whole control program.

Here’s the gist of the corresponding code:

MODULE Oberon;

  PROCEDURE Reset*(origin: INTEGER);
    VAR p1: Process;
  BEGIN
    IF origin = 0 THEN (* trap *)
      IF (cp # NIL) & (cp.state = Active) THEN (* it's a process *)
        p1 := cp;
        RemoveProc(p1); (* also resets the process *)
        IF p1.numRestarts < MaxNumRestarts THEN
          InstallProc(p1); INC(p1.numRestarts)
        ELSE
          SysCtrl.IncNumRestarts; (* count the number of system restarts, see below *)
          Restart
        END
      END
    ELSE
      (* handle reset via abort *)
    END;
    SYSTEM.LDREG(14, Kernel.stackOrg);
    Loop
  END Reset;

END Oberon.

As of now, there is no facility to decrease the restart counter of a process over time. To be added.

System Reset and Restart

Embedded Oberon does not have functionality to reset and restart the whole system from software.

Manually pressing the abort button on the target board resets the RISC5 CPU and peripherals via the rst line, which causes the CPU to start executing the bootloader as described here. However, due to the check SYSTEM.REG(LNK) = 0 in the body of the bootloader, the system software will not be reloaded from the SD card, but execution will continue by directly calling the abort handler in System.mod, installed at address 0, which calls Oberon.Reset, as described above. (Address 0 would be entry point to the body of Modules if it just had been loaded from the boot file.)

To implement a system reset and reload, it seemed best to stay in-line and compatible with the abort procedure:

  • reset: invoke a RISC5 processor reset in the FPGA, just as the abort button does,
  • reload: but don’t skip reloading of the system software from the SD card.

To configure and control the reset and startup process, the System Control Register is added to the FPGA. It is accessed my module SysCtrl.mod. The System Control Register is also where the number of system restarts are counted – see above SysCtrl.IncNumRestarts.

Reset

Bit [0] of the System Control Register controls the system reset.

module RISC5Top
  /* ... */

  // system control register
  reg [23:0] sysCtrlReg = 24'b0;
  always @(posedge clk) begin
    sysCtrlReg <= ~rst ? {sysCtrlReg[23:1], 1'b0} : (wr & (ioenb & iowadr == 239)) ? outbus[23:0] : sysCtrlReg; 
  end

  // reset
  wire rstSig = (cnt[4:0] == 0) & limit;    // limit is the 1ms timer output
  wire rstTrig = ~(btn[3] | sysCtrlReg[0]);
  always @(posedge clk) begin
    rst <= rstSig ? rstTrig : rst;
  end

endmodule

Setting sysCtrlReg[0] to logic one will invoke the reset, and it will be set back to zero upon reset. This mimics the abort button press, that is, the CPU will start to execute the bootloader.

Reload

In order to keep all the reset and reload logic together, in lieu of using the link register to determine is the system should be reloaded from the SD card, the System Control Register defines two bits:

  • bit [1]: if set, skip (re-)loading the system files
  • bit [2]: if set, skip the initialisation of the SD card

By using the “skip” logic, the System Control Register initialised to zero results in the same behaviour as with Embedded Oberon.

The possibility to skip the re-initilisation of the SD card is probably not required, as the card must accept an initialisation sequence also when in the initialised state, according to SD specs. My experiences with SD cards of different vendors shows that the cards can be pretty capricious, and this feature might come handy, so I left it there for now.

The bootloader checks the System Control Register:

MODULE* BootLoad;

  (* ... *)

  CONST
    SysCtrlRegAdr = -68;
    SkipReload = 1;
    SkipCardInit = 2;

  (* ... *)

BEGIN
  (* ... *)
  IF ~SYSTEM.BIT(SysCtrlRegAdr, SkipReload) THEN
    IF ~SYSTEM.BIT(SysCtrlRegAdr, SkipCardInit) THEN
      InitSPI
    END;
    LoadFromDisk
  END;
  (* ... *)
END BootLoad.

With all this in place, we can now implement a software-initiated reset and reload of the system.

Oberon.Restart

Oberon.Restart is called from Oberon.Reset, as outlined above. It can also be executed as command.

MODULE Oberon;

  PROCEDURE Restart*;
    VAR x: SET;
  BEGIN
    Texts.WriteLn(W); Texts.WriteString(W, "RESTART"); Texts.WriteLn(W)
    SysCtrl.GetReg(x);
    SysCtrl.SetReg(x + {SysCtrl.Reset} - {SysCtrl.SkipLoad, SysCtrl.SkipCardInit});
    REPEAT UNTIL FALSE
  END Restart;

END Oberon.

Error Recovery Revisited

Above, we have this list the of error recovery attempts:

  1. Reset and restart the faulty process
  2. Reset and restart the whole system and control program

Oberon.Reset implements the first step. It also counts the number of resets for a faulty process, and initiates a system reset and reload if that number exceeds a fixed limit (same for all processes for now).

Oberon.Restart implements the second step – partly. It resets the RISC5 processor via FPGA logic, which runs the bootloader, which in turn reloads the system files from the SD card. It does not reload the control programs, though. For this, we’ll need another small extension of Embedded Oberon, automatic program start upon system start, to be described elsewhere.

System Restart Counter

Oberon.Reset counts the number of system reloads using SysCtrl.IncNumRestarts. Note that only system reloads caused by run-time errors are counted thusly, not the ones via executing Oberon.Restart from the UI.

With the above reset and restart mechanics, a persistent error in a process results in endless system reloads. With the number of system restarts stored in the System Control Register, which survives a reload, this can be stopped. The place to check for repeated restarts is the body of Oberon, which is the entry point for the Outer Core. The required behaviour in case the system and the control programs cannot be stabilised by one or more system reloads is application specific.

For now, Oberon’s body just halts the system if there are too many system restarts.

Watchdog

Another run-time error condition that should be detected is a stuck process, that is, one executing an infinite loop without yielding control, thus bringing the whole system and control program to a halt.

To detect a stuck process, the FPGA is extended with a simple watchdog, that is, a timer which requires to be reset from software before it expires, else it initiates a hardware-based action.

The watchdog triggers the RISC5 interrupt, and a handler in Oberon.mod takes over. Apart from writing a message to the console, it simply restarts the system using Oberon.Restart, also incrementing the system restart counter.

A more fine-grained error recovery would be to just reset the interrupted process, which is the stuck one, analogous to handling traps. But the interrupt routine would return to the point of interrupt, which is not a reasonable address: first, the process was just reset, and, second, without reset, the return would be to the point in code that caused the issue in the first place, ie. the infinite loop.

If the return address of the interrupt handler could be changed by the handler itself,1 it would be possible to kill and reset the faulty process, and return to the Loop, but that would imply a change to the RISC5 CPU. Not going there for now.

The watchdog is accessed via WatchDog.mod.

MODULE WatchDog;

  CONST 
    WatchDogAdr = -100;
    Timeout = 100; (* ms *)

  PROCEDURE* Reset*;
  BEGIN
    SYSTEM.PUT(WatchDogAdr, Timeout)
  END Reset;

  PROCEDURE* Disable*;
  BEGIN
    SYSTEM.PUT(WatchDogAdr, 0)
  END Disable;

END WatchDog.

The watchdog is integrated into Oberon.Loop. It is disabled during command and upload handling.

MODULE Oberon;

  PROCEDURE Loop;
  BEGIN
    IF Console.Available() THEN
      WatchDog.Disable;
      (* command and upload handling *)
    ELSE
      WatchDog.Reset;
      (* process scheduling *)
    END
  END Loop;

END Oberon.

Peripheral Device Timeouts

Yet another error condition is a peripheral device that does not reply within a timeout limit, including not replying at all (anymore). Timeouts in the context of devices are described here.

Device timeouts are reported back to the client process, which can take corrective measures, or simply ASSERT the error condition, which then results in a trap, entering the error recovery as described above.

Other Error Detection

Last we have yet another error condition, a process that never runs (anymore). In software, this can be taken care of by an audit process. An FPGA-based approach needs some more thinking.

Demo Trap Handling

MODULE TestTraps;

  IMPORT Out, Oberon;
    
  VAR
    p1: Oberon.Process;
    s1: ARRAY 1024 OF BYTE;
    
  PROCEDURE p1c;
    VAR i, k, now: INTEGER;
  BEGIN
    now := Oberon.Time();
    k := 2;
    Out.String("start p1"); Out.Ln;
    REPEAT
      Out.String("p1"); Out.Int(now, 10); Out.Ln;
      DEC(k);
      i := k DIV k; (* trap *)
      Oberon.NextProc;
      now := Oberon.Time()
    UNTIL FALSE
  END p1c;
    
  PROCEDURE Run*;
  BEGIN
    Oberon.InstallProc(p1)
  END Run;
  
BEGIN
  NEW(p1); Oberon.InitProc(p1, p1c, s1, 1000, 0)
END TestTraps.

The demo program creates this output in the Astrobe console:

start p1
p1    212208
p1    213208

TRAP 6 at pos 368 in TestTraps at 000126D8

Starting scheduler
start p1
p1    213214
p1    214214

TRAP 6 at pos 368 in TestTraps at 000126D8

Starting scheduler
start p1
p1    214220
p1    215220

TRAP 6 at pos 368 in TestTraps at 000126D8

Starting scheduler
start p1
p1    215226
p1    216226

TRAP 6 at pos 368 in TestTraps at 000126D8

RESTART

Oberon RTS 2020-07-07
Based on Embedded Oberon 2019-07-01
RISC5 version: 0D010005
System start
System status: 00010002
Starting scheduler

The maximum number of process restarts is set to three. Note that also the scheduler is reset and started, as with all Oberon.Reset operations.

Watchdog Demo

MODULE TestWatchdog;

  IMPORT Out, Oberon;

  VAR
    p1: Oberon.Process;
    s1: ARRAY 1024 OF BYTE;
  
  PROCEDURE p1c;
    VAR i, x: INTEGER;
  BEGIN
    Out.String("trigger watchdog"); Out.Ln;
    i := 3;
    REPEAT
      Out.String("counting down... "); Out.Int(i, 0); Out.Ln;
      IF i = 0 THEN
        FOR x := 0 TO 15000000 DO END; (* about one second duration *)
        i := 4;
        Out.String("continuing"); Out.Ln
      END;
      DEC(i);
      Oberon.NextProc
    UNTIL FALSE
  END p1c;
    
  PROCEDURE Run*;
  BEGIN
    Oberon.InstallProc(p1)
  END Run;
  
  PROCEDURE CmdLong*;
    VAR x: INTEGER;
  BEGIN
    Out.String("long duration command... just wait"); Out.Ln;
    FOR x := 0 TO 15000000 DO END;
    Out.String("continuing"); Out.Ln
  END CmdLong;
  
BEGIN
  NEW(p1); Oberon.InitProc(p1, p1c, s1, 1000, 0)
END TestWatchdog.

Executing Run results in the following output:

trigger watchdog
counting down... 3
counting down... 2
counting down... 1
counting down... 0

WATCHDOG

RESTART

Oberon RTS 2020-07-07
Based on Embedded Oberon 2019-07-01
RISC5 version:  0D010005
System start
System status:  00040002
Starting scheduler

Executing CmgLong does not trigger the watchdog, as it is disabled while a command is running.


  1. Easy with a Cortex M3 processor. ↩︎