The Crowd Supply campaign is going very well! Already over 200% funded and still 21 days to go, yay! :)
As part of the campaign I'm supposed to post useful or interesting weekly updates. This week I decided to explain a feature that may cause some confusion: the application watchdog. Since it's not related to the campaign per se, but explains a feature, I decided to reproduce it here as well. Enjoy!
If you are an applications developer for embedded systems, sending your solutions to remote locations where they can’t be reached for reset or service, this update is for you.
The LiFePO4wered/Pi+ has many helpful features for different use cases. Some people like to use the on/off button to boot and shut down their Pi, while others turn on auto-boot and/or auto-shutdown to get on/off behavior based on external power for their application. The wake-up timer feature is really cool and helpful in low power, low duty cycle applications, and it’s pretty easy to understand what it does and how to use it.
One feature that people have more trouble with is the application watchdog. Unless you come from an electronics or embedded systems background, you may not be familiar with what a watchdog is or does. So in this update, I’ll explain what it is, why you want it, and how you can use the one provided by the LiFePO4wered/Pi+ to make your Raspberry Pi-based project more reliable.
The Wikipedia article on watchdog timers explains that a watchdog timer is “an electronic timer that is used to detect and recover from computer malfunctions.” Let’s face it: computers are complicated beasts with many moving parts that all need to work correctly. There will always be bugs (or cosmic rays) that will cause something to act up at some point. If your computer is on your desk, you can reboot it when this happens, but what if you are using a computer like a Raspberry Pi in an embedded system? What about in something that needs to run automatically and reliably, may be hard to reach inside a machine, or is located far away in a remote location? The Wikipedia article notes that “watchdog timers are essential in remote, automated systems,” giving the example of a Mars rover. If your Mars rover’s computer crashed, who would go reboot it?
Watchdog timers are usually extremely simple, and this is a good thing. It’s the complexity of computers that makes them susceptible to crashes and hangups, so it makes sense that the system that watches over them should be less susceptible to these issues, hence simpler. Ideally, watchdog timers are implemented in hardware. In the LiFePO4wered/Pi+, the watchdog is implemented in firmware. Not quite as desirable as hardware, but if you compare the complexity of the firmware on the LiFePO4wered/Pi+ (4 kB) with that of the Linux system running on the Pi (4 GB), it’s orders of magnitude simpler.
The basic concept of a watchdog timer is that the software running on the system needs to regularly reset the timer (commonly referred to as “kicking” or “feeding” the dog, depending on how you feel toward dogs), because if you fail to do so, it will reset you instead (the dog will “bite” you) when the time runs out. This ensures that if your system hangs, it can be reset and brought back to a responsive state.
In the LiFePO4wered/Pi+ implementation, by default the watchdog is off
(WATCHDOG_OFF
or 0
). This is because it’s conceived as an
application watchdog: it’s not just there to ensure the Linux system
is up, but also to ensure that whatever application you are running is
behaving as it should. So how you implement this is up to you as a
developer and depends on what functionality is critical to your
application.
The watchdog can be set to two levels by writing to the WATCHDOG_CFG
register. The first level, WATCHDOG_ALERT
(1
), is useful during
development or when the Pi system is within view of an operator who
can take action. At this level, the PWR LED will start flashing the
fast error flash when the timer expires. The second level
WATCHDOG_SHDN
(2
) will trigger a Raspberry Pi shutdown when the
timer expires, and is most likely what you want in production systems.
Note that all it does is trigger a shutdown – nothing more. This means
you still benefit from the nice shutdown behavior you expect from the
LiFePO4wered/Pi+: it will always try to do a clean shutdown first
(your application may have crashed but the Linux system may still be
running, so why risk corrupting it by doing a hard reset if you don’t
have to?). If the clean shutdown doesn’t succeed though, power will
be forced off after the settable shutdown timeout (in case the whole
system was locked up). Unless the proper response to a watchdog
timeout for your application is to turn completely off and stay off,
you should use one of the auto-boot settings to turn the shutdown into
a reboot instead so you get the system back up.
To prevent the watchdog from biting you, your application needs to
write a timeout value to the WATCHDOG_TIMER
register, and keep doing
so before the time that was written last expires. The register can
also be read to see how much time is left and counts down with
ten-second resolution. The timeout value you want depends on your
application, or even the specific thing you are doing in your
application. You need to find the balance between how long you can
allow your system to stay unresponsive if something is wrong, and how
long you can expect certain operations to take if everything is
working.
The simplest watchdog example application in Python would be something like this:
from lifepo4wered import write_lifepo4wered, WATCHDOG_TIMER from time import sleep while True: write_lifepo4wered(WATCHDOG_TIMER, 10) sleep(5)
Of course, this example would have limited use in the real world. It only ensures the Linux system is alive, your Python installation is working and the Raspberry Pi and LiFePO4wered/Pi+ can communicate. Please resist the urge to just add a process like this to your system and think the watchdog is now guarding your application. To be really effective, the watchdog reset calls need to be intertwined with your application logic, and depend on its correct behavior.
Here’s a better example. Imagine you have a Raspberry Pi-based system that reads sensor data and sends it to a cloud server over a cellular modem. Without getting lost in implementation details, this would be an effective use of the watchdog:
from lifepo4wered import write_lifepo4wered, WATCHDOG_TIMER from time import sleep from sensor import read_sensor # Part of your app from cloud import send_to_cloud # Part of your app while True: data = read_sensor() if data.read_success: if send_to_cloud(data): write_lifepo4wered(WATCHDOG_TIMER, 120) sleep(10)
Here the watchdog guards many more potential error conditions. First of all, the watchdog will never be reset and will therefore reboot the system if for some reason we can’t talk to the sensor. The sensor itself may contain firmware that could have crashed, so by powering down the system and restarting, the sensor might be returned to a working condition. The same for the cellular modem. If we fail to send data to the cloud because the modem has locked up, we may recover by powering down and back up again.
Note that the watchdog timer value is set much higher than the expected loop time. This allows the system some time to work through adverse conditions without getting rebooted all the time. For instance, if the cellular modem suffers from a bad connection, it may take a while to get the data out. In this case, we’ve decided that if the system isn’t responsive for two minutes, something must be wrong and we let the watchdog reboot us.
There is one last register related to the watchdog that I want to
touch on: WATCHDOG_GRACE
. This provides a grace period after the
system has booted until your application has the chance to write to
the WATCHDOG_TIMER
register for the first time. You can think of it
as the initial value for the WATCHDOG_TIMER
register after boot.
Depending on how heavy your application is, it may take a while to
start and you wouldn’t want the watchdog to reboot the system before
your application is ready to go.
The WATCHDOG_TIMER
register is the only one your app should have to
deal with. The WATCHDOG_CFG
and WATCHDOG_GRACE
registers are
supposed to be configured and written to flash with CFG_WRITE
before
deployment. This ensures the watchdog is always active with the right
configuration so you don’t depend on the system you’re trying to
protect to launch the thing that’s supposed to protect it… which
might not happen if something is wrong.
Full details on how to use the watchdog registers (or any register for
that matter) can be found in the LiFePO4wered/Pi+ Product
Brief.
I hope this guide can help you build reliable Raspberry Pi-based
systems by using the watchdog!
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.