Really serious projects must work without interruption even in the event of failure of individual subsystems. And there are many reasons for disruption: server hardware failure, software failures, crashes at the data center level. But all of these risks can be avoided or their consequences minimized.
Fault tolerance is a system ability to continue working properly in case of failure of separate components – servers or communication channels, failures at the level of separate system modules, etc.
It is worth knowing that building and maintaining a fault-tolerant system will be more complex and expensive than developing and maintaining an ordinary system. You should approach design of each specific solution from the point of view of economic feasibility. And in order to make the decision criteria objective, you need indicators that allow you to measure and compare different options.
Fault tolerance is difficult to measure by itself, but the availability of service, expressed as a percentage, can be measured. From an analytical point of view, it is best to measure uptime over long intervals – at least a year, or better yet, over an even longer interval. Up-time in the range of 99.8-99.9% – is the normal value for normal projects on shared hosting or VPS – it’s about 1-2 hours of disability per month or about 12 hours of inaccessibility of the service per year. A score of about 99.95% – the equivalent of 4 hours of unavailability per year – is already good enough for single-server installations and for software not originally designed for high fault tolerance. If the required uptime level is 99.99% or higher, it usually requires both building the appropriate server infrastructure and modifying the project’s code base to work in high fault tolerance mode.
To provide a normal level of availability, it is not necessary to build a fault-tolerant system: a well-written application code, adequate maintenance processes are enough, it is recommended to use the services of professional hosting companies – they reserve communication channels, power and cooling equipment, as well as to use reliable dedicated servers for single-server installations – preferably physical, rather than virtual ones.
To achieve a high level of availability, the mechanics of building fault-tolerant systems are already used – in particular, the redundancy of all critical subsystems, which allows the application to function even if one of the components fails. There are two main ways – horizontal scaling or duplication of all servers and setting up their automatic hot swapping. In both cases, all critical system components are duplicated, the only difference is in the normal modes of operation and the mechanics of fault tolerance.