Pages

Wednesday 28 September 2011

Body of Lies

I just saw Body of Lies by Ridley Scott. At minute 1h38’ there is a scene where Di Caprio talks on the phone to the bad guys who kidnapped his girlfriend. The call is recorded so to make that obvious to the audience, there is a massive old-school tape recorder spinning in the background. Cell phones, drones and satellites… You can never go wrong with good old reel-to-reel tape recording.

Sunday 25 September 2011

I Own a Domain Name I Can’t Use!

Checkout2

I bought a domain name through Blogger a few months ago, got my receipt through Google Checkout and went on holiday. The Google Checkout receipt has a link to retrieve the Google Apps account to manage the domain: that link doesn’t work any more and the domain does not appear in Blogger.

My domain fell into a black hole: I can prove I bought it, but I can’t use it!

Saturday 10 September 2011

Fire-proof Software #2: Designing Fault Tolerant Systems

OLYMPUS DIGITAL CAMERA

Follow-up from part 1.

This is a few of the failure conditions we see with real production software, along with some approaches to deal with them. The general idea is to define activities.

  • Each activity is a unit of work that may either complete or fail and that can be retried if necessary.
  • An activity has a start and an end, as well as a status (Complete/Failed).
  • An activity also has success criteria that allow a process to check the activity actually completed.

The database contains a schedule of those activities along with a MaxRunTime value and a flag indicating if it's currently running. The MaxRunTime allows processes to spot an activity that timed out.

Input data not available (Failure of one of the upstream systems)

You can’t bet on the fact that upstream systems will work, regardless of the reason why, they will fail. There is no point trying to imagine the reasons behind the potential failure, what matters is the impact of the failure on your system and on the business.

  • Have a retry policy for each activity (retry period + retry window)
  • Use a stale version of the data (previous day for instance) if acceptable with the business.

Notification not available

  • Back-up the notification mechanism with an activity schedule: an activity should automatically start at a defined time if it hasn’t already.
  • If an activity has already started following a notification, the DB flag in the activity schedule table will guarantee that it doesn't start again.

Process crash

  • Have a Windows service detect the process crash and restart the process.
  • If the process crashes following an unhandled exception, the currently running activity will be automatically marked as failed. When the process starts again it should attempt to schedule or start the failed activity.
  • If the process is simply killed with no opportunity to mark the current activity as failed the activity will still appear as running in the ActivitySchedule. When the process starts again it will see that its activity is currently running and will simply schedule a check at StartTime + MaxRunTime. Obviously the check will fail and the process will restart the activity.

Process stalls

  • Kill it with a watchdog thread: this is a pattern used by embedded systems on real-time OSs to reset a stalled CPU. Inside the app server process the main thread –the one that does all the work- should periodically reset the watchdog flag. If the watchdog flag is not reset after a defined period of time the watchdog thread kills the process after failing the current activity.

Server down or unreachable

  • Use 2 servers, one primary and one backup.
  • Each server runs identical processes scheduling the same activities. Both processes will attempt to start their activity at the same time however the ActivitySchedule table in the DB will allow only one process to actually start the activity.
  • If one server goes down, all processes will simply attempt to reschedule their activities upon restart.
  • If one server goes down while an activity is running, the activity will still appear as running in the ActivitySchedule. Upon restart processes will see the activity is marked as running and will schedule a check at the expected completion time.

Database not available

  • Avoid maintaining a single DB connection for too long, that reduces the opportunities of the system to reconnect.
  • If you can detect the connection failure and don’t want the complexity of implementing a retry mechanism, at least fail the current activity to take advantage of the activity’s retry policy.

Saturday 3 September 2011

Sink-Proof Software #1: Designing Fault Tolerant Systems

OLYMPUS DIGITAL CAMERAMiddleware Enterprise apps usually do straightforward things: they pump data from here, crunch it a bit, dump it there. Complexity comes from the variety of data sources, their reliability and more importantly how much availability (up-time) the business requires.

Say you write an app that processes a daily list of financial instruments to perform a Present Value calculation. What if the input data is not available, i.e. if that list of instruments is not ready? Do you raise a critical fault and give up? Do you use a stale list of instruments? Do you wait a bit and retry? How long should you retry, should you retry forever until the list is available or is there a point where you should give up? Should the retry period be constant or should it increase exponentially to save resources?

What if one of your processes was waiting for a notification from another system and that notification was never received?

What do you do if a process crashes because of an unhandled exception, an out-of-memory exception or a third-party library bug? Is there a system in place to restart the process automatically? Does it take over its task where it left off or does it restart from scratch?

What do you do if a process stalls? Because of a blocked DB call for instance, or some synchronous API call that never comes back…

What if that process was handling client requests? Do you have a backup process to handle the requests while the primary process recovers? Can the system handle a down time?

What do you do if the entire application server goes down either because an engineer tripped on the power cable, the server room was flooded by hurricane Irene, a plane crashed on your main datacentre, the datacentre's power supply failed or there is no more network connectivity to the datacentre?

What do you do if the database is not available for some time? Do you store data in memory temporarily? Do you detect it and try and reconnect or do you just fail on all subsequent DB calls? Do you reroute the DB logging to log files so that logs remain accessible?