High Availability
Fault Tolerance
Processing should continue if a node or instance has died. This includes servers, VMs, web containers, and databases.
Processing should continue if an external system is down. e.g. Insurance system is not critical in a booking system, so user should be able to continue if insurance is not available.
The load balancer should monitor and check that instances are healthy.
Logging
The writing of logs should be asynchronous (non-blocking).
Logs should be automatically archived / rotated / purged. Check how long it will take before disk space runs out if the archiving job fails.
The application should still function if the disk is full.
Consider centralised logging, especially if application instances are disposable.
Throttling
Check if the application can limit requests to protect itself, and what happens if it is flooded with requests. It could ‘drop’ requests or return busy responses.
Check if there an alert system in place in case throttling is required.
Timeout
A timeout:
- should be set between the application and database
- should be set for incoming transactions into the application
- should be set for any interaction between tiers
- should be set for interactions between application and external systems.
Exponential backoff should also be considered if a downstream system is unavailable.