Watchdog feature guide

The Crowd Supply campaign is going very well! Already over 200% funded and still 21 days to go, yay! :)

As part of the campaign I'm supposed to post useful or interesting weekly updates. This week I decided to explain a feature that may cause some confusion: the application watchdog. Since it's not related to the campaign per se, but explains a feature, I decided to reproduce it here as well. Enjoy!

If you are an applications developer for embedded systems, sending your solutions to remote locations where they can’t be reached for reset or service, this update is for you.

The LiFePO4wered/Pi+ has many helpful features for different use cases. Some people like to use the on/off button to boot and shut down their Pi, while others turn on auto-boot and/or auto-shutdown to get on/off behavior based on external power for their application. The wake-up timer feature is really cool and helpful in low power, low duty cycle applications, and it’s pretty easy to understand what it does and how to use it.

One feature that people have more trouble with is the application watchdog. Unless you come from an electronics or embedded systems background, you may not be familiar with what a watchdog is or does. So in this update, I’ll explain what it is, why you want it, and how you can use the one provided by the LiFePO4wered/Pi+ to make your Raspberry Pi-based project more reliable.

The Wikipedia article on watchdog timers explains that a watchdog timer is “an electronic timer that is used to detect and recover from computer malfunctions.” Let’s face it: computers are complicated beasts with many moving parts that all need to work correctly. There will always be bugs (or cosmic rays) that will cause something to act up at some point. If your computer is on your desk, you can reboot it when this happens, but what if you are using a computer like a Raspberry Pi in an embedded system? What about in something that needs to run automatically and reliably, may be hard to reach inside a machine, or is located far away in a remote location? The Wikipedia article notes that “watchdog timers are essential in remote, automated systems,” giving the example of a Mars rover. If your Mars rover’s computer crashed, who would go reboot it?

Watchdog timers are usually extremely simple, and this is a good thing. It’s the complexity of computers that makes them susceptible to crashes and hangups, so it makes sense that the system that watches over them should be less susceptible to these issues, hence simpler. Ideally, watchdog timers are implemented in hardware. In the LiFePO4wered/Pi+, the watchdog is implemented in firmware. Not quite as desirable as hardware, but if you compare the complexity of the firmware on the LiFePO4wered/Pi+ (4 kB) with that of the Linux system running on the Pi (4 GB), it’s orders of magnitude simpler.

The basic concept of a watchdog timer is that the software running on the system needs to regularly reset the timer (commonly referred to as “kicking” or “feeding” the dog, depending on how you feel toward dogs), because if you fail to do so, it will reset you instead (the dog will “bite” you) when the time runs out. This ensures that if your system hangs, it can be reset and brought back to a responsive state.

In the LiFePO4wered/Pi+ implementation, by default the watchdog is off (WATCHDOG_OFF or 0). This is because it’s conceived as an application watchdog: it’s not just there to ensure the Linux system is up, but also to ensure that whatever application you are running is behaving as it should. So how you implement this is up to you as a developer and depends on what functionality is critical to your application.

The watchdog can be set to two levels by writing to the WATCHDOG_CFG register. The first level, WATCHDOG_ALERT (1), is useful during development or when the Pi system is within view of an operator who can take action. At this level, the PWR LED will start flashing the fast error flash when the timer expires. The second level WATCHDOG_SHDN (2) will trigger a Raspberry Pi shutdown when the timer expires, and is most likely what you want in production systems. Note that all it does is trigger a shutdown – nothing more. This means you still benefit from the nice shutdown behavior you expect from the LiFePO4wered/Pi+: it will always try to do a clean shutdown first (your application may have crashed but the Linux system may still be running, so why risk corrupting it by doing a hard reset if you don’t have to?). If the clean shutdown doesn’t succeed though, power will be forced off after the settable shutdown timeout (in case the whole system was locked up). Unless the proper response to a watchdog timeout for your application is to turn completely off and stay off, you should use one of the auto-boot settings to turn the shutdown into a reboot instead so you get the system back up.

To prevent the watchdog from biting you, your application needs to write a timeout value to the WATCHDOG_TIMER register, and keep doing so before the time that was written last expires. The register can also be read to see how much time is left and counts down with ten-second resolution. The timeout value you want depends on your application, or even the specific thing you are doing in your application. You need to find the balance between how long you can allow your system to stay unresponsive if something is wrong, and how long you can expect certain operations to take if everything is working.

The simplest watchdog example application in Python would be something like this:

from lifepo4wered import write_lifepo4wered, WATCHDOG_TIMER
from time import sleep

while True:
  write_lifepo4wered(WATCHDOG_TIMER, 10)
  sleep(5)

Of course, this example would have limited use in the real world. It only ensures the Linux system is alive, your Python installation is working and the Raspberry Pi and LiFePO4wered/Pi+ can communicate. Please resist the urge to just add a process like this to your system and think the watchdog is now guarding your application. To be really effective, the watchdog reset calls need to be intertwined with your application logic, and depend on its correct behavior.

Here’s a better example. Imagine you have a Raspberry Pi-based system that reads sensor data and sends it to a cloud server over a cellular modem. Without getting lost in implementation details, this would be an effective use of the watchdog:

from lifepo4wered import write_lifepo4wered, WATCHDOG_TIMER
from time import sleep
from sensor import read_sensor  # Part of your app
from cloud import send_to_cloud # Part of your app

while True:
  data = read_sensor()
  if data.read_success:
    if send_to_cloud(data):
      write_lifepo4wered(WATCHDOG_TIMER, 120)
  sleep(10)

Here the watchdog guards many more potential error conditions. First of all, the watchdog will never be reset and will therefore reboot the system if for some reason we can’t talk to the sensor. The sensor itself may contain firmware that could have crashed, so by powering down the system and restarting, the sensor might be returned to a working condition. The same for the cellular modem. If we fail to send data to the cloud because the modem has locked up, we may recover by powering down and back up again.

Note that the watchdog timer value is set much higher than the expected loop time. This allows the system some time to work through adverse conditions without getting rebooted all the time. For instance, if the cellular modem suffers from a bad connection, it may take a while to get the data out. In this case, we’ve decided that if the system isn’t responsive for two minutes, something must be wrong and we let the watchdog reboot us.

There is one last register related to the watchdog that I want to touch on: WATCHDOG_GRACE. This provides a grace period after the system has booted until your application has the chance to write to the WATCHDOG_TIMER register for the first time. You can think of it as the initial value for the WATCHDOG_TIMER register after boot. Depending on how heavy your application is, it may take a while to start and you wouldn’t want the watchdog to reboot the system before your application is ready to go.

The WATCHDOG_TIMER register is the only one your app should have to deal with. The WATCHDOG_CFG and WATCHDOG_GRACE registers are supposed to be configured and written to flash with CFG_WRITE before deployment. This ensures the watchdog is always active with the right configuration so you don’t depend on the system you’re trying to protect to launch the thing that’s supposed to protect it… which might not happen if something is wrong.

Full details on how to use the watchdog registers (or any register for that matter) can be found in the LiFePO4wered/Pi+ Product Brief. I hope this guide can help you build reliable Raspberry Pi-based systems by using the watchdog!

Crowd Supply campaign is LIVE!

Production progress

Discussions

Become a Hackaday.io Member