Designing a safety system

When an unexpected or unrecoverable fault occurs, the worst outcome is for the system to hang. That’s not only a poor user experience; it can also create unsafe conditions. The goal is to ensure that, even in failure, the system transitions into a safe state that minimizes risk to people, property, and the device itself.

The Instrument Core includes a Safety System designed with this in mind. I say “designed” or “attempts” deliberately, because some failure modes lie beyond what software alone can address. For instance, if the RP2350 suffers a hardware fault, it may no longer be capable of running any code at all. True safety, therefore, must extend beyond software into the physical design of the system itself.

Scope of the Safety System

There’s a lot we can do in code. Our Safety System is designed to handle the following situations:

Unexpected runtime conditions, like memory corruptions, heap and stack overflows, panics, etc., that occur outside of the systems' normal execution flow.
Hangs and crashes where one core stops running or tasks become starved of CPU time to the point that they no longer function properly.
Unrecoverable errors triggered when the system encounters a condition incompatible with proper operation (for example, an external device failing to initialize or ceasing to respond).

In all these cases, faults should be captured by a central handler that immediately places the entire system in a known-safe state, gathers diagnostic information (e.g., fault location, cause, and system state at the time), and then performs a controlled reboot to attempt recovery.

However, that alone isn’t sufficient. If we allow unlimited reboots, a persistent fault could create an infinite restart loop, potentially causing further damage to the device or connected components. To prevent this, the system must track reboot counts and enter a lockout state once a defined limit is reached, requiring a physical power cycle (ideally performed after the underlying fault has been addressed).

Finally, if the system operates normally for a period of time after a reboot long enough to ensure that any two failures are unrelated, the Safety System should reset the reboot counter to avoid accidental lockouts caused by isolated or transient errors.

Trapping system faults

As you can see, there’s more to the Safety System than meets the eye. The Instrument Core begins by trapping various SDK and FreeRTOS calls that signal anomalous conditions, including:

vApplicationMallocFailedHook() — triggered on FreeRTOS memory allocation failures. Since all memory allocation is delegated to FreeRTOS, this also covers malloc() and delete.
vApplicationStackOverflowHook() — invoked when a stack overflow occurs.
isr_hardfault() — called when a hard fault occurs at the Cortex-M33 core level. There are also more specific fault handlers (isr_memmanage(), isr_busfault(), and isr_usagefault()), which fall back to the hard fault handler if not implemented.
assert() — both from standard C code and FreeRTOS internals.
abort() — the traditional C/C++ function for unrecoverable conditions.

When any of these hooks are triggered, they invoke the Safety System with details about the nature of the fault. The Safety System then captures as much diagnostic data as possible, such as task name, stack contents, heap state, and so on, and stores it in a persistent area of RAM before issuing a controlled reset.

Dealing with hangs and crashes

The RP2350 includes a hardware watchdog that automatically reboots the CPU unless it’s “fed” periodically. The Instrument Core’s Safety System uses this feature to handle cases where either core hangs or enters an infinite loop.

On core 0, a low-priority FreeRTOS task is created whose sole responsibility is to feed the hardware watchdog. If any higher-priority task hangs, or if the system becomes so overloaded that the feeder task can’t run often enough, the watchdog will expire and trigger a reboot.

Since there’s only one hardware watchdog, and core 1 runs outside FreeRTOS, it must feed a virtual watchdog managed by the feeder task on core 0. If core 1 stops responding, the feeder task will detect the missed updates, stop feeding the hardware watchdog, and once again allow a reset to occur.

Application faults and reporting

The safety system provides a macro, T76_ASSERT(), which triggers a fault if a given expression evaluates to false. It functions like a standard assert(), except it’s always active, regardless of whether the build is in debug or release mode.

All fault mechanisms ultimately invoke T76::Sys::Safety::handleFault(), which records diagnostic data in uninitialized RAM, which survives a reboot because it isn’t cleared by the bootloader. The reporting mechanism relies only on statically allocated memory and minimizes stack usage. This increases baseline memory consumption slightly but gives the safety system a fighting chance to operate even if the heap or stack have been corrupted by a rogue process.

In the worst-case scenario—where the reporting mechanism itself fails, resulting in a double fault—either the hard-fault handler or, as a last resort, the hardware watchdog will still restart the system. The downside is that you may lose visibility into the original fault, but the device will at least avoid entering an unsafe state.

System safing

By itself, the safety system knows nothing about the specific hardware or implementation details of the device it runs on—and therefore, it has no inherent knowledge of how to render that device safe either.

To address this, the system defines the T76::Sys::Safety::SafeableComponent interface. Any class that implements this interface must define two methods: activate() and makeSafe(), and must register itself by calling T76::Sys::Safety::registerSafeableComponent() within its constructor.

By convention, components are assumed to start in a safe, inactive state and will not enter operational mode unless activate() is explicitly called. Once active, invoking makeSafe() should return the component to its disabled, safe condition.

The system does not guarantee the order in which components are activated or made safe. To manage dependencies, you can group components under a parent object and delegate activation and safing order from there.

Bringing it all together: The startup process

It’s easy to assume that handling a fault when it occurs is enough to make a system safe. In practice, though, a serious fault can leave the environment in such a corrupted state that even running recovery code may be impossible: the heap might be trashed, the stack exhausted, or critical memory regions overwritten.

While the fault handlers do attempt to render the system safe before rebooting, most of the safety logic actually runs at startup, inside T76::Sys::Safety::safetyInit(). To ensure reliable behavior, you should avoid dynamically allocated components (especially those tied to safety-critical functions) since their existence cannot be guaranteed at all times.

At startup, the logic first forces a call to makeSafe() on all registered components. This “just-in-case” step ensures that nothing remains in an unsafe state. This approach assumes all components are statically allocated, which is generally good practice in embedded systems.

Next, the system checks whether the reboot was triggered by the hardware watchdog. If so, no handler could have run before the reset, so the system records the event as a hardware fault.

Before activation, the system verifies whether the device has If it has, it enters lockout mode to prevent further restarts.

Finally, the Safety System calls activate() on all registered components. If any component fails to activate (i.e., returns false), the system treats that as a new fault and triggers another controlled reboot.

Lockout mode: never miss a fault again

One thing that absolutely drives me nuts about embedded development is how hard it can be to figure out what actually caused a fault. Even with a debugger or serial cable hooked up (which, I’ll admit, I’m usually too lazy to use unless absolutely necessary), you need to be watching the processor at exactly the right moment to catch the failure in action.

This gets even trickier when using USB as the serial interface: TinyUSB typically doesn’t start until fairly late in the boot process, and even then it depends on FreeRTOS being up and running before it can function properly.

To mitigate this, the Safety System includes a minimal diagnostic stack that runs whenever the device enters lockout mode. This stack is intentionally simple: it spawns one task to manage CDC-over-USB and another that continuously outputs a complete log of all recorded faults leading up to the lockout.

It’s not a replacement for a proper debugger, but it’s a huge help when investigating faults after the fact. The main limitation, of course, is that it only runs once the system enters lockout mode, meaning you can’t use it to inspect one-off or transient faults. Those are still recorded internally by the Safety System, though; we'll make accessible once we add SCPI support to the IC.

Limitations

As you can see, there’s a lot happening within the Safety System. But as I mentioned at the start of this post, using it alone is nowhere near enough to make your device truly safe. There are countless reasons why a device’s MPU might never execute a single line of code; that means safe design begins with hardware and must be part of your development philosophy from day one.

Designing systems that are safe by default is well beyond the scope of this project. Hopefully, though, the safety system can help you start your design on the right foot and build a stronger foundation for reliability.

You can find the code for the safety system in the add-safety-mechanisms branch of the code repository.

Scope of the Safety System

Trapping system faults

Dealing with hangs and crashes

Application faults and reporting

System safing

Bringing it all together: The startup process

Lockout mode: never miss a fault again

Limitations

Managing cross-core memory allocation

Going modular

Discussions

Become a Hackaday.io Member