Pages

Saturday, 10 September 2011

Fire-proof Software #2: Designing Fault Tolerant Systems

OLYMPUS DIGITAL CAMERA

Follow-up from part 1.

This is a few of the failure conditions we see with real production software, along with some approaches to deal with them. The general idea is to define activities.

  • Each activity is a unit of work that may either complete or fail and that can be retried if necessary.
  • An activity has a start and an end, as well as a status (Complete/Failed).
  • An activity also has success criteria that allow a process to check the activity actually completed.

The database contains a schedule of those activities along with a MaxRunTime value and a flag indicating if it's currently running. The MaxRunTime allows processes to spot an activity that timed out.

Input data not available (Failure of one of the upstream systems)

You can’t bet on the fact that upstream systems will work, regardless of the reason why, they will fail. There is no point trying to imagine the reasons behind the potential failure, what matters is the impact of the failure on your system and on the business.

  • Have a retry policy for each activity (retry period + retry window)
  • Use a stale version of the data (previous day for instance) if acceptable with the business.

Notification not available

  • Back-up the notification mechanism with an activity schedule: an activity should automatically start at a defined time if it hasn’t already.
  • If an activity has already started following a notification, the DB flag in the activity schedule table will guarantee that it doesn't start again.

Process crash

  • Have a Windows service detect the process crash and restart the process.
  • If the process crashes following an unhandled exception, the currently running activity will be automatically marked as failed. When the process starts again it should attempt to schedule or start the failed activity.
  • If the process is simply killed with no opportunity to mark the current activity as failed the activity will still appear as running in the ActivitySchedule. When the process starts again it will see that its activity is currently running and will simply schedule a check at StartTime + MaxRunTime. Obviously the check will fail and the process will restart the activity.

Process stalls

  • Kill it with a watchdog thread: this is a pattern used by embedded systems on real-time OSs to reset a stalled CPU. Inside the app server process the main thread –the one that does all the work- should periodically reset the watchdog flag. If the watchdog flag is not reset after a defined period of time the watchdog thread kills the process after failing the current activity.

Server down or unreachable

  • Use 2 servers, one primary and one backup.
  • Each server runs identical processes scheduling the same activities. Both processes will attempt to start their activity at the same time however the ActivitySchedule table in the DB will allow only one process to actually start the activity.
  • If one server goes down, all processes will simply attempt to reschedule their activities upon restart.
  • If one server goes down while an activity is running, the activity will still appear as running in the ActivitySchedule. Upon restart processes will see the activity is marked as running and will schedule a check at the expected completion time.

Database not available

  • Avoid maintaining a single DB connection for too long, that reduces the opportunities of the system to reconnect.
  • If you can detect the connection failure and don’t want the complexity of implementing a retry mechanism, at least fail the current activity to take advantage of the activity’s retry policy.

No comments: