Stack Overflow Detection and Handling

Introduction

(Embedded) Oberon uses three major memory regions:

  • module space, from address 20H (MTOrg) up to Kernel.stackOrg - 8000H;
  • workspace, or stack, from Kernel.stackOrg - 8000H up to Kernel.stackOrg;
  • dynamic space, or heap, from Kernel.stackOrg up to Kernel.MemLim.

Modules.Load ensures that no module space is allocated inside the stack memory region. However, there is no protection against using more stack space than conceptually allocated. That is, the stack can overflow into module space, and depending on the specific situation, corrupt the system to a more or lesser dramatic obvious extent. More dramatic would mean that the system and its programs immediately and “visibly” stop working as intended, less dramatic would mean that the corruption is subtle, and might not have immediate and obvious consequences, but could nonetheless be serious, for example through faulty results from control algorithms and laws.

The stack space is actually hardcoded to 8000H, ie. 32 kB, which seems sufficient, or even plenty, for many practical control applications. However, it would be easy to make this value configurable, and, for a more RAM-restricted use case, smaller, hence making stack overflow detection more relevant for run-time safety.

Stack overflow may or may not be a potential issue for a specific control program. One can argue that a well-tested program will not be in any danger. Depending on the operating environment and conditions, rogue errors, for example due to voltage spikes on sensor connections, which could lead to run-away tasks, are not relevant. Stack overflow detection and handling can be omitted in such a use case, obviously.

Stack Overflow Detection

Software

Basically, the CPU itself does not have any notion of a stack, or the memory layout in general. Just as Modules.Load checks for the upper memory limit upon loading a module, or Kernel.New checks for the upper limit when allocating dynamic memory, the value of the stack pointer could be checked in software in the prologue of every procedure, which is the only point where the stack pointer value is decreased in the direction of module space and can potentially cause a stack overflow (in a “normal” program, ie. not for example an error handler). This would be costly as regards run-time overhead.

Hardware

However, there’s the FPGA: a hardware-based continuous stack pointer check does not come with any run-time overhead. The stack pointer value is simply continuously compared to a limit, and when stepping across that limit, the system is reset, or an interrupt is triggered which then allows the software to deal with the situation.

Stack Overflow Handling

Overview

As outlined above, the stack pointer value is decreased via the procedure prologue. In case a stack overflow is detected, ie. the stack pointer value assumes an address value inside the module space, we cannot immediately stop the CPU from executing the prologue. Even in the simplest case, a procedure without parameters and local variables, the link register is stored on the stack, hence potentially corrupting the code and data in module space. As we don’t know what the overwritten module space contents was, we cannot reverse the corruption after the stack overflow occurred and was detected. Hence, if the stack pointer points into module space, the only feasible solution is to restart the system.

If we want to avoid this simple, but drastic corrective measure, we need to detect a potential stack overflow earlier. This can be achieved by defining a hot zone right above the limit of the module space. Now, when the stack pointer reaches into this hot zone, subtler corrective measures can be taken, such as killing or re-initialising the offending task, without affecting other tasks. Obviously, the corrective measures themselves, which run in the same stack as the offending task handler, must not cause the stack pointer to reach into module space, else the situation as outlined above occurs.

The hot zone reduces the actually available stack space by its size, which is the price to pay for the additional run-time safety of the stack overflow detection.

The size of the hot zone depends on the control program, its task handlers and the called procedures and their stack space allocations, and the call-chains. Obviously, procedures called “high up” in the stack will never cause a stack overflow.

The needed stack space for the error handling adds to the hot zone size. This error handling stack space can reduced to as little as 12 bytes. The current system, used for development and testing, uses a hot zone size of 512 bytes.

Current Solution

Hardware

The current solution in the FPGA is structured as follows:

  • The value of CPU register 14, ie. the stack pointer register, is “pulled out” of the RISC5 CPU and made available in RISC5Top.v.
  • In RISC5Top.v, we have two registers that can be written from software,
    • one for the upper address of the hot zone
    • one for the absolute lower stack pointer value limit
  • If the stack pointer value
    • goes below the upper hot zone address, an interrupt is triggered;
    • goes below the absolute limit, the processor is reset.

Note that a processor reset will happen anyway, should the corrective measures after reaching into the hot zone decrease the stack pointer value below the absolute limit.

Software

The interrupt triggered by reaching into the hot zone is of the “non-returning” type, and will cause the abort handler to be invoked, where the error can be handled.

Which Limit?

The stack space limits can be set to two conceptually different values:

  • The fixed value for the allocated stack space, ie. Kernel.stackOrg - 8000H.
  • The dynamic value of the currently used module space, indicated my Modules.AllocPtr.

The reasoning for using the dynamic value of Modules.AllocPtr would be to allow the stack to use all the currently available memory, even from the unused module space. However, this could result in a situation where a specific task handler sometimes works, sometimes not, depending on other loaded modules. Also, to be consistent, Modules.Load would now need to check for free space against the current stack pointer, and possibly be allowed to capture memory in the stack space. Too much dynamics – especially for a control system – and possible headache situations, for little or no real practical advantages.

Consequently, stack overflow is checked against the fixed stack size value. The two limits – absolute module space limit, as well as the hot zone limit – are configured upon system startup.

Limit Tracking

For development and possibly operation, an additional reg in the hardware tracks the lowest stack pointer value that ever occurred since reset, allowing to monitor the stack usage from software.