Evaluating Availability under Heavy-tailed Repair Time

The time to recover from failures has a great impact on the availability of Information Technology (IT) systems. We find that the repair times have heavy-tailed power law distributions with scaling exponents close
to one for two IT systems, an in-house system hosted by IBM and a high performance computing system at the Los Alamos National Laboratory.
This means that the repair times of these systems have infinite variance and may also have infinite mean.
As a result, a claasical metrics based on thie mean time to repair are not suitable for evaluating the availability of these systems.
We propose a new metric, the T-year return value, for evaluating the reliability of IT systems.
The $T$-year return value refers to the value that the mean repair time exceeds on average once every $T$ years estimated based on the extreme value theory. We evaluate the $T$-year return values of
the two IT systems and find that the $T$-year return value can well represent the system availability.

By: Sei Kato and Takayuki Osogami

Published in: , volume , (no ), pages in 2007


