2.3.1 Reliability and unreliability
2.3.2 Basic reliability formulas
2.3.3 Mean time to failure (MTTF)
2.3.4 Mean time between failures (MTBF)
2.3.5 Availability and mean time to repair (MTTR)
2.3.1 Reliability and unreliability
The British Standard BS 4778 Section 3.1 1991 and Section 3.2 define reliability as "the characteristic of an item expressed by the probability that it performs a required function under required conditions for a stated period of time".
When assessing the reliability of a product there are four important elements to be considered as follows: -
(a) Function
(b) Conditions of use
(c) Time interval
(d) Probability
These elements are defined in to BS 5760 Part 2 1994, 'Guide to the Assessment of Reliability', Section 4.
Reliability is difficult and time consuming to measure. The accurate assessment of the reliability of a product in use requires a long time or a large number of samples in order to gain statistical confidence in the assessment.
During the design phase, when the product is largely conceptual, or during development, when only prototypes exist, reliability can be more difficult to assess. At this stage, reliability estimates are often made based on experience with similar products and earlier generations. Reliability data accumulates during development and the early phases of introduction to use.
Some failures may be obvious, such as the blowing of a light bulb, but in other cases, it may be much harder to determine that a failure has occurred. For example, a complex electronic circuit that must meet a detailed specification has failed if any of its parameters have moved outside their specified limits although this may not be apparent to the user. An electronic interface circuit may be required to have a certain immunity to noise voltage; failure to maintain this immunity would not be noticed under noise free conditions, but the symptoms would be apparent under more extreme conditions. It is likely to be difficult to localise the cause.
Failure may occur as a catastrophic failure, i.e. it is complete and sudden, like a light bulb failing, or as degradation, i.e. it is gradual or partial, like an electronic unit moving outside specification. In the case of an electrical supply, a complete loss of power would be catastrophic failure, while voltage or frequency deviation would be regarded as degradation. A failure is primary if it is not caused by failure in another part of the system, and is secondary if it is the result of the failure of another part of the system. Reliability data can only give information about primary failure.
2.3.2 Basic reliability formulas
If the number of components tested is No, the number of components which fail in time t is Nf and the number which survive is Ns, then
Equation 1
where R(t) is known as the reliability function, and
Equation 2
where Nf(t) is the number of failures to time t. Q(t) is also called the failure probability.
Failure Rate
The probability density f(t) is probably the most fundamental function in reliability theory.
Equation 3
We need to relate the number of components that failed on a particular day to the number of components that were exposed to failure on the same day, not the initial number of components. This is a conditional probability because it is the probability of failure on a given day, subject to the condition that the component has survived to that day and is known as the failure rate, or hazard rate l(t), defined as: -
Equation 4
Failure rate curves
Figure 4 shows the failure rate l(t), plotted against time measured in days. It is often referred to as a bathtub curve and is typical of that obtained for many electronic components. It has three distinct sections:
Early life or burn-in
This is the period (up to day 4 in this illustration), when l(t) is decreasing as weak or substandard components fail. It is known as the early life, early failure, infant mortality or burn-in period. The reasons for the first three terms should be obvious. Burn-in is a process sometimes used in the final stages of manufacture of components in order to weed out early failures. It involves the components being run under normal conditions (or controlled conditions somewhat more severe than normal to accelerate the process) for sufficiently long to get through the early life period.
Useful life - normal operating period
This is a period of effectively constant, relatively low l(t) (from day 5 to day 31 in Figure 4), known as the useful life or normal operating period. During this time, the failure rate is independent of the time for which the component has been run. In other words, the probability of failure of a component is the same throughout this period.
Figure 4 - Failure rate against time

Wear-out - old age
This is the period, when the failure rate increases steeply with time (beyond day 31 in this illustration), known as the wear-out or old age period.
The bathtub curve describes the behaviour that might well be expected for many types of component, or even for complex systems such as a UPS. Taking the familiar example of light bulbs, failures during early life might be due to filaments that have been badly attached to their supports, that have been locally drawn too thin or that have nicks in them. Another cause of early failure could be leaky glass envelopes, leading to filament oxidisation. During useful life the filaments gradually evaporate and become thinner until they break, usually under the thermal stress induced by current surges at switch on. If all bulbs were identical and were operated under identical conditions, they would all fail at the same time. However, since no manufacturing process will produce identical components, some will have serious faults leading to failure during early life, and the failure rate decreases as the weaker components are weeded out. Similarly, it is reasonable to expect a range of failure times during the wear-out period. In the case of light bulbs, filament composition, thickness, length and shape will vary slightly from bulb to bulb, thus leading to a spread in the time at which they finally break. As the bulbs get older, their probability of failure increases, giving a steep rise on the right-hand side of the bathtub curve.
Memory-less process or catastrophic failure
It is perhaps harder to see why there should be any failures at all during the normal operating period, once early failures have been weeded out and there has not yet been any significant wear. A detailed analysis of the numerous ways in which a light bulb could fail would be extremely complex because of the variety of mechanical, thermal and chemical processes that can take place. The wide range of possibilities, each with its own time dependence, averages out to produce a failure probability that is effectively independent of time. The process is referred to as memory-less, because the probability of failure is independent of previous history. These failures are also sometimes called catastrophic because they are unexpected.
Figure 5 shows a sketch of a typical bathtub curve. Useful life extends from Te to Tw, i.e. after early life failures have occurred and before wear-out starts to be significant.
Failure rate during useful life
The manufacture of components and assemblies often involves a burn-in process so that those subject to early life failure can be removed from the supply chain. In many cases, electronic components do not reach their wear-out period during operational life - they may have useful lives that are much longer than the operational life of the system in which they are used. Routine maintenance procedures may be designed to ensure that components are replaced well before the onset of wear-out. Because of this, it is often possible to assume that components are run only during their period of useful life and that they have a constant failure rate, lu.
Figure 5 - Useful lifetime

2.3.3 Mean time to failure (MTTF)
The mean time to failure is a term that is applied to non-repairable parts such as light bulbs and is a measurement of the average time to failure of a large number of similar parts which operate under specified conditions. The conditions of test are important, for example an increase in the operating temperature of most components will reduce the MTTF. MTTF may be calculated from the equation: -
![]()
In practice, the MTTF is often calculated from data taken over a period of time in which not all the components fail. In this case
Equation 5
The relationship between MTTF and the parameter l holds only for exponential distribution. The MTTF can be estimated from the results of reliability tests or from statistics by keeping records of component failures. This can form part of the servicing/maintenance procedures for equipment.
Example
The maintenance records for a large organisation show that during a period corresponding to 1,000,000 operating hours for a particular piece of equipment, 80 of these failed and had to be replaced.
If 80 units fail in 1,000,000 hours, the MTTF of one piece of equipment is
![]()
2.3.4 Mean time between failures (MTBF)
In the case of components or system elements that can be repaired, failure rates are often expressed in terms of the mean time between failures (MTBF) rather than mean time to failure (MTTF). This is a measure of the average time that a piece of equipment performs its function without requiring repair (although it may require routine scheduled maintenance).
Many components cannot be repaired; they can fail only once in their lifetime after which they have to be replaced. However, in more complex systems, elements can usually be repaired: faulty components on circuit boards or faulty boards in equipment racks can be replaced and breaks in cables can be repaired.
Example
A television rental company operates their sets for an average of 900 hours per year. The company records show that during one year 26,300 of its 210,000 rented televisions had to be repaired.
Total number of operating hours = 210,000 x 900 = 189 x 106 hours.
![]()
Because MTBF is a statistical quantity, a large number of faults must be recorded in order to establish confidence in the result. Testing one piece of equipment for a very long time is impracticable. It is usual to test a large number of samples simultaneously for a shorter period, and to determine the total number of faults in the total operating time for all of the equipment. This method assumes that burn-in and wear-out failure modes are not involved.
2.3.5 Availability and mean time to repair (MTTR)
The average time needed for repairs is known as the mean time to repair (MTTR). It must be taken into account when calculating availability. Repair times need to be considered when using MTBFs to estimate the effective reliability of the system.
In practice the MTTR can depend on a whole range of factors including: -
· Time needed to learn about the fault.
· Time needed to locate the fault.
· Time needed to isolate the fault.
· Time needed to gain access to the fault.
· Access to the service engineer and time needed to reach site.
· Availability of (and delivery time of) spare parts.
· Time needed to repair the fault and to make any necessary adjustments and perform tests.
The availability of a system or of a component is the proportion of time for which it is operating correctly. It is the ratio of operational time to total time, which is the sum of the operational and repair times.
Equation 6
System users are sometimes more concerned with the availability rather than the reliability of such systems. It is usually important to maximise the proportion of time a system is available, and this can involve trade-offs between component reliability and repair time. For instance, hard-wired components are usually much more reliable than plug-in ones, because of the relatively high failure rates of connections. On the other hand, the repair times of plug-in components may be much shorter than those for hard wired ones because they can simply be replaced. The use of plug-in components can result in higher availability, but with a higher failure rate. The optimum balance depends on the absolute values of MTBF and MTTR.