Pages

Saturday, 3 September 2011

Sink-Proof Software #1: Designing Fault Tolerant Systems

OLYMPUS DIGITAL CAMERAMiddleware Enterprise apps usually do straightforward things: they pump data from here, crunch it a bit, dump it there. Complexity comes from the variety of data sources, their reliability and more importantly how much availability (up-time) the business requires.

Say you write an app that processes a daily list of financial instruments to perform a Present Value calculation. What if the input data is not available, i.e. if that list of instruments is not ready? Do you raise a critical fault and give up? Do you use a stale list of instruments? Do you wait a bit and retry? How long should you retry, should you retry forever until the list is available or is there a point where you should give up? Should the retry period be constant or should it increase exponentially to save resources?

What if one of your processes was waiting for a notification from another system and that notification was never received?

What do you do if a process crashes because of an unhandled exception, an out-of-memory exception or a third-party library bug? Is there a system in place to restart the process automatically? Does it take over its task where it left off or does it restart from scratch?

What do you do if a process stalls? Because of a blocked DB call for instance, or some synchronous API call that never comes back…

What if that process was handling client requests? Do you have a backup process to handle the requests while the primary process recovers? Can the system handle a down time?

What do you do if the entire application server goes down either because an engineer tripped on the power cable, the server room was flooded by hurricane Irene, a plane crashed on your main datacentre, the datacentre's power supply failed or there is no more network connectivity to the datacentre?

What do you do if the database is not available for some time? Do you store data in memory temporarily? Do you detect it and try and reconnect or do you just fail on all subsequent DB calls? Do you reroute the DB logging to log files so that logs remain accessible?

No comments: