Practical Guide to Watchdog Timers in Embedded Systems

Overview

In embedded systems development, the watchdog is an important mechanism that automatically restarts the system when a program fails, helping to maintain stability. This article examines watchdog timers and their role in maintaining program stability, to help you understand and apply watchdog techniques.

What a Watchdog Is

It is important to clarify that a watchdog does not directly improve the intrinsic stability or reliability of the software. It can only restore functionality by restarting the application or system when the program malfunctions. Fundamental software stability and reliability still depend on disciplined and rigorous programming practices.

Watchdog Types

Watchdogs are mainly divided into hardware watchdogs and software watchdogs, which differ significantly in implementation and use cases.

1. Hardware Watchdog

The core of a hardware watchdog is a timing circuit. The monitored CPU provides periodic "feed the watchdog" signals to reset the timer. When the CPU operates normally, it feeds the watchdog periodically so the timer does not expire. If the CPU fails and cannot provide the feed signal, the watchdog timer will expire and trigger a reset signal, causing a CPU restart, commonly described as "reset by the watchdog." Hardware watchdogs can be further classified into two types:

1.1 External independent hardware watchdog
This watchdog is independent of the MCU/MPU and requires no driver support. The monitored system only needs to provide a level transition on the watchdog input within a specified interval, typically implemented via a GPIO toggling. Its timeout is fixed; once hardware is connected it cannot be disabled unless the connection is removed. This type offers very high reliability and is suitable for applications with stringent reliability requirements, but it adds hardware cost.
External independent hardware watchdogs are mainly used to address system hang-ups caused by harsh external environments where manual intervention is not possible, and to handle restarts caused by hardware faults.
1.2 Built-in hardware watchdog
The built-in hardware watchdog uses an internal timer inside the processor to implement the timing function. It requires a driver in the system to initialize the timer and perform feed operations, and its parameters can often be adjusted. The timeout is configurable and it can sometimes be disabled via special instructions. It has lower cost, but it may fail if the processor itself hangs, so it is suitable where extreme hardware reliability is not required.
Built-in hardware watchdogs are typically driven by the system plus fed by the application and are mainly used to recover from application-level faults, and to handle some environment-induced resets.

2. Software Watchdog

A software watchdog uses an internal processor timer instead of a dedicated hardware timing circuit. This approach simplifies the hardware design but is less reliable than a hardware timer. For example, if the internal timer fails, the software watchdog cannot detect the issue. Reliability can be improved by using dual timers to monitor each other, but that increases system overhead and still does not solve all problems, such as failures in the interrupt system that prevent timer interrupts from firing.

Software watchdogs are normally implemented with a system driver and application-level feeding, and are mainly used to monitor application hangs.

Correct Use of a Watchdog

A watchdog is not a substitute for solving system problems. Faults discovered during debugging should be resolved by fixing design or code errors. A watchdog is intended to handle potential bugs and disturbances from harsh environments that could cause system hangs, allowing the system to automatically recover when unattended. However, a watchdog cannot completely eliminate losses caused by faults. From the moment a fault occurs until the system is reset and recovers, the system may be unavailable. Some systems also need to protect and restore runtime state before and after reset, which may require additional software and hardware overhead.

Reliability Ranking and Feeding

Overall, the reliability ranking is: external independent watchdog > built-in hardware watchdog > software watchdog. For feeding operations, an external independent watchdog is typically handled automatically by the system hardware so the application does not need to worry about it. The watchdogs that the application must feed are either the built-in hardware watchdog or the software watchdog, depending on the platform resources. The application must feed the watchdog within the specified interval to signal that it is operating correctly. If a poorly written program fails to feed the watchdog, a system reset will occur. Developers must locate and fix such issues to address potential problems that could affect system operation.