Embedded Oberon
The Oberon compiler inserts the following run-time checks, resulting in traps if violated:
- array index of of bounds
- type guard failure
- array or string copy overflow
- access via NIL pointer
- illegal procedure call
- integer division by zero
- assertion violation
The trap number corresponds to the item number in this list. The trap handler in System.mod prints a message, and calls Oberon.Reset. If the trap was caused by a task, it is removed from the task list, hence not invoked again by the Loop. Oberon.Reset also resets the Loop, resetting the stack pointer to its startup value.
Assessment
While reasonable and sufficient for a user-supervised system, running in a controlled environment, such as an office or lab, it is not for an unsupervised control system, often running in more “hostile” environments such as factory floors. Peripheral devices may be connected via cables outside the controller’s housing, for example. Connection issues, or glitches in data transmission, can result in missing or erroneous data, and thus run-time errors not due to the programmed logic.
Also, simply removing one process might leave the overall system in an inconsistent or inoperable state, where essential functions of the control process are not executed anymore, which then results in errors or malfunctions in the controlled system.
A Basic Error Recovery Approach
Just like Embedded Oberon attempts to get the system into a stable and predictable state again by removing the faulty task, and restarting the Loop, Oberon RTS should attempt to get the control program and its processes into a stable and predictable state.
Considering that run-time errors can be caused by rare, hazardous events in the processor’s environment, as outlined above, the following recovery attempts can be pursued by Oberon RTS:
- reset and restart the faulty process
- reset and restart the whole system and control program
Faulty processes could also be simply shut down, if not essential for the system’s operations, eg. a process showing the current time. Even a process driving a display might not be essential for the control process proper, in particular in an error situation.
Of course, we also need a mechanism that breaks repeated process or system restarts within a defined period of time, as well as logging and alarm facilities to support and enable an operator to get on-site and investigate and fix the issue.
To get off the ground with a basic solution, let’s focus on the moment when the system is in panic mode, trying to achieve a stable state, without lots of bells and whistles.
Process Reset and Restart
A control process usually holds some state, be it implemented by module variables for a set of tasks that constitute the process, or by a coroutine itself. Upon a run-time error, a process should be reset to get it into a defined state (in the literal sense).
The trap handler calls Oberon.Reset
, which is the entry point to begin with the recovery procedure. Oberon.Reset is also called from the abort handler System.Abort
. A trap stems from a run-time error, while abort signifies a user interaction, therefore with potentially different handling strategies. Oberon.Reset gets an integer parameter to identify the origin of the call, which requires the small corresponding change in System.mod.
To avoid endlessly resetting and restarting the same process, in case the problem persists, each process gets a restart counter. Should that restart counter exceed a maximum limit, as a next step to get the system stable, the system is reset and restarted. The on-startup facility to autoload modules (to be described elsewhere) then restarts the whole control program.
Here’s the gist of the corresponding code:
MODULE Oberon;
PROCEDURE Reset*(origin: INTEGER);
VAR p1: Process;
BEGIN
IF origin = 0 THEN (* trap *)
IF (cp # NIL) & (cp.state = Active) THEN (* it's a process *)
p1 := cp;
RemoveProc(p1); (* also resets the process *)
IF p1.numRestarts < MaxNumRestarts THEN
InstallProc(p1); INC(p1.numRestarts)
ELSE
SysCtrl.IncNumRestarts; (* count the number of system restarts, see below *)
Restart
END
END
ELSE
(* handle reset via abort *)
END;
SYSTEM.LDREG(14, Kernel.stackOrg);
Loop
END Reset;
END Oberon.
As of now, there is no facility to decrease the restart counter of a process over time. To be added.
System Reset and Restart
Embedded Oberon does not have functionality to reset and restart the whole system from software.
Manually pressing the abort button on the target board resets the RISC5 CPU and peripherals via the rst
line, which causes the CPU to start executing the bootloader as described here. However, due to the check SYSTEM.REG(LNK) = 0
in the body of the bootloader, the system software will not be reloaded from the SD card, but execution will continue by directly calling the abort handler in System.mod, installed at address 0, which calls Oberon.Reset, as described above. (Address 0 would be entry point to the body of Modules if it just had been loaded from the boot file.)
To implement a system reset and reload, it seemed best to stay in-line and compatible with the abort procedure:
- reset: invoke a RISC5 processor reset in the FPGA, just as the abort button does,
- reload: but don’t skip reloading of the system software from the SD card.
To configure and control the reset and startup process, the System Control Register is added to the FPGA. It is accessed my module SysCtrl.mod. The System Control Register is also where the number of system restarts are counted – see above SysCtrl.IncNumRestarts
.
Reset
Bit [0] of the System Control Register controls the system reset.
module RISC5Top
/* ... */
// system control register
reg [23:0] sysCtrlReg = 24'b0;
always @(posedge clk) begin
sysCtrlReg <= ~rst ? {sysCtrlReg[23:1], 1'b0} : (wr & (ioenb & iowadr == 239)) ? outbus[23:0] : sysCtrlReg;
end
// reset
wire rstSig = (cnt[4:0] == 0) & limit; // limit is the 1ms timer output
wire rstTrig = ~(btn[3] | sysCtrlReg[0]);
always @(posedge clk) begin
rst <= rstSig ? rstTrig : rst;
end
endmodule
Setting sysCtrlReg[0]
to logic one will invoke the reset, and it will be set back to zero upon reset. This mimics the abort button press, that is, the CPU will start to execute the bootloader.
Reload
In order to keep all the reset and reload logic together, in lieu of using the link register to determine is the system should be reloaded from the SD card, the System Control Register defines two bits:
- bit [1]: if set, skip (re-)loading the system files
- bit [2]: if set, skip the initialisation of the SD card
By using the “skip” logic, the System Control Register initialised to zero results in the same behaviour as with Embedded Oberon.
The possibility to skip the re-initilisation of the SD card is probably not required, as the card must accept an initialisation sequence also when in the initialised state, according to SD specs. My experiences with SD cards of different vendors shows that the cards can be pretty capricious, and this feature might come handy, so I left it there for now.
The bootloader checks the System Control Register:
MODULE* BootLoad;
(* ... *)
CONST
SysCtrlRegAdr = -68;
SkipReload = 1;
SkipCardInit = 2;
(* ... *)
BEGIN
(* ... *)
IF ~SYSTEM.BIT(SysCtrlRegAdr, SkipReload) THEN
IF ~SYSTEM.BIT(SysCtrlRegAdr, SkipCardInit) THEN
InitSPI
END;
LoadFromDisk
END;
(* ... *)
END BootLoad.
With all this in place, we can now implement a software-initiated reset and reload of the system.
Oberon.Restart
Oberon.Restart is called from Oberon.Reset, as outlined above. It can also be executed as command.
MODULE Oberon;
PROCEDURE Restart*;
VAR x: SET;
BEGIN
Texts.WriteLn(W); Texts.WriteString(W, "RESTART"); Texts.WriteLn(W)
SysCtrl.GetReg(x);
SysCtrl.SetReg(x + {SysCtrl.Reset} - {SysCtrl.SkipLoad, SysCtrl.SkipCardInit});
REPEAT UNTIL FALSE
END Restart;
END Oberon.
Error Recovery Revisited
Above, we have this list the of error recovery attempts:
- Reset and restart the faulty process
- Reset and restart the whole system and control program
Oberon.Reset
implements the first step. It also counts the number of resets for a faulty process, and initiates a system reset and reload if that number exceeds a fixed limit (same for all processes for now).
Oberon.Restart
implements the second step – partly. It resets the RISC5 processor via FPGA logic, which runs the bootloader, which in turn reloads the system files from the SD card. It does not reload the control programs, though. For this, we’ll need another small extension of Embedded Oberon, automatic program start upon system start, to be described elsewhere.
System Restart Counter
Oberon.Reset counts the number of system reloads using SysCtrl.IncNumRestarts
. Note that only system reloads caused by run-time errors are counted thusly, not the ones via executing Oberon.Restart from the UI.
With the above reset and restart mechanics, a persistent error in a process results in endless system reloads. With the number of system restarts stored in the System Control Register, which survives a reload, this can be stopped. The place to check for repeated restarts is the body of Oberon, which is the entry point for the Outer Core. The required behaviour in case the system and the control programs cannot be stabilised by one or more system reloads is application specific.
For now, Oberon’s body just halts the system if there are too many system restarts.
Watchdog
Another run-time error condition that should be detected is a stuck process, that is, one executing an infinite loop without yielding control, thus bringing the whole system and control program to a halt.
To detect a stuck process, the FPGA is extended with a simple watchdog, that is, a timer which requires to be reset from software before it expires, else it initiates a hardware-based action.
The watchdog triggers the RISC5 interrupt, and a handler in Oberon.mod takes over. Apart from writing a message to the console, it simply restarts the system using Oberon.Restart
, also incrementing the system restart counter.
A more fine-grained error recovery would be to just reset the interrupted process, which is the stuck one, analogous to handling traps. But the interrupt routine would return to the point of interrupt, which is not a reasonable address: first, the process was just reset, and, second, without reset, the return would be to the point in code that caused the issue in the first place, ie. the infinite loop.
If the return address of the interrupt handler could be changed by the handler itself,1 it would be possible to kill and reset the faulty process, and return to the Loop, but that would imply a change to the RISC5 CPU. Not going there for now.
The watchdog is accessed via WatchDog.mod.
MODULE WatchDog;
CONST
WatchDogAdr = -100;
Timeout = 100; (* ms *)
PROCEDURE* Reset*;
BEGIN
SYSTEM.PUT(WatchDogAdr, Timeout)
END Reset;
PROCEDURE* Disable*;
BEGIN
SYSTEM.PUT(WatchDogAdr, 0)
END Disable;
END WatchDog.
The watchdog is integrated into Oberon.Loop. It is disabled during command and upload handling.
MODULE Oberon;
PROCEDURE Loop;
BEGIN
IF Console.Available() THEN
WatchDog.Disable;
(* command and upload handling *)
ELSE
WatchDog.Reset;
(* process scheduling *)
END
END Loop;
END Oberon.
Peripheral Device Timeouts
Yet another error condition is a peripheral device that does not reply within a timeout limit, including not replying at all (anymore). Timeouts in the context of devices are described here.
Device timeouts are reported back to the client process, which can take corrective measures, or simply ASSERT the error condition, which then results in a trap, entering the error recovery as described above.
Other Error Detection
Last we have yet another error condition, a process that never runs (anymore). In software, this can be taken care of by an audit process. An FPGA-based approach needs some more thinking.
Demo Trap Handling
MODULE TestTraps;
IMPORT Out, Oberon;
VAR
p1: Oberon.Process;
s1: ARRAY 1024 OF BYTE;
PROCEDURE p1c;
VAR i, k, now: INTEGER;
BEGIN
now := Oberon.Time();
k := 2;
Out.String("start p1"); Out.Ln;
REPEAT
Out.String("p1"); Out.Int(now, 10); Out.Ln;
DEC(k);
i := k DIV k; (* trap *)
Oberon.NextProc;
now := Oberon.Time()
UNTIL FALSE
END p1c;
PROCEDURE Run*;
BEGIN
Oberon.InstallProc(p1)
END Run;
BEGIN
NEW(p1); Oberon.InitProc(p1, p1c, s1, 1000, 0)
END TestTraps.
The demo program creates this output in the Astrobe console:
start p1
p1 212208
p1 213208
TRAP 6 at pos 368 in TestTraps at 000126D8
Starting scheduler
start p1
p1 213214
p1 214214
TRAP 6 at pos 368 in TestTraps at 000126D8
Starting scheduler
start p1
p1 214220
p1 215220
TRAP 6 at pos 368 in TestTraps at 000126D8
Starting scheduler
start p1
p1 215226
p1 216226
TRAP 6 at pos 368 in TestTraps at 000126D8
RESTART
Oberon RTS 2020-07-07
Based on Embedded Oberon 2019-07-01
RISC5 version: 0D010005
System start
System status: 00010002
Starting scheduler
The maximum number of process restarts is set to three. Note that also the scheduler is reset and started, as with all Oberon.Reset operations.
Watchdog Demo
MODULE TestWatchdog;
IMPORT Out, Oberon;
VAR
p1: Oberon.Process;
s1: ARRAY 1024 OF BYTE;
PROCEDURE p1c;
VAR i, x: INTEGER;
BEGIN
Out.String("trigger watchdog"); Out.Ln;
i := 3;
REPEAT
Out.String("counting down... "); Out.Int(i, 0); Out.Ln;
IF i = 0 THEN
FOR x := 0 TO 15000000 DO END; (* about one second duration *)
i := 4;
Out.String("continuing"); Out.Ln
END;
DEC(i);
Oberon.NextProc
UNTIL FALSE
END p1c;
PROCEDURE Run*;
BEGIN
Oberon.InstallProc(p1)
END Run;
PROCEDURE CmdLong*;
VAR x: INTEGER;
BEGIN
Out.String("long duration command... just wait"); Out.Ln;
FOR x := 0 TO 15000000 DO END;
Out.String("continuing"); Out.Ln
END CmdLong;
BEGIN
NEW(p1); Oberon.InitProc(p1, p1c, s1, 1000, 0)
END TestWatchdog.
Executing Run
results in the following output:
trigger watchdog
counting down... 3
counting down... 2
counting down... 1
counting down... 0
WATCHDOG
RESTART
Oberon RTS 2020-07-07
Based on Embedded Oberon 2019-07-01
RISC5 version: 0D010005
System start
System status: 00040002
Starting scheduler
Executing CmgLong
does not trigger the watchdog, as it is disabled while a command is running.
-
Easy with a Cortex M3 processor. ↩︎