Self-Adapting Reliability in Distributed Software Systems
by Yuriy Brun, Jae young Bang, George Edwards, Nenad Medvidovic
Abstract:

Developing modern distributed software systems is difficult in part because they have little control over the environments in which they execute. For example, hardware and software resources on which these systems rely may fail or become compromised and malicious. Redundancy can help manage such failures and compromises, but when faced with dynamic, unpredictable resources and attackers, the system reliability can still fluctuate greatly. Empowering the system with self-adaptive and self-managing reliability facilities can significantly improve the quality of the software system and reduce reliance on the developer predicting all possible failure conditions.

We present iterative redundancy, a novel approach to improving software system reliability by automatically injecting redundancy into the system's deployment. Iterative redundancy self-adapts in three ways: (1) by automatically detecting when the resource reliability drops, (2) by identifying unlucky parts of the computation that happen to deploy on disproportionately many compromised resources, and (3) by not relying on a priori estimates of resource reliability. Further, iterative redundancy is theoretically optimal in its resource use: Given a set of resources, iterative redundancy guarantees to use those resources to produce the most reliable version of that software system possible; likewise, given a desired increase in the system's reliability, iterative redundancy guarantees achieving that reliability using the least resources possible. Iterative redundancy handles even the Byzantine threat model, in which compromised resources collude to attack the system.

We evaluate iterative redundancy in three ways. First, we formally prove its self-adaptation, efficiency, and optimality properties. Second, we simulate it at scale using discrete event simulation. Finally, we modify the existing, open-source, volunteer-computing BOINC software system and deploy it on the globally-distributed PlanetLab testbed network to empirically evaluate that iterative redundancy is self-adaptive and more efficient than existing techniques.

Citation:
Yuriy Brun, Jae young Bang, George Edwards, and Nenad Medvidovic, Self-Adapting Reliability in Distributed Software Systems, IEEE Transactions on Software Engineering (TSE), vol. 41, no. 8, August 2015, pp. 764–780.
Related:
Extended and revised version of "Smart redundancy for distributed computation" in ICDCS 2011.
Bibtex:
@article{Brun15tse,
  author = {Yuriy Brun and Jae young Bang and George Edwards and Nenad Medvidovic},
  title =
  {\href{http://people.cs.umass.edu/brun/pubs/pubs/Brun15tse.pdf}{Self-Adapting
  Reliability in Distributed Software Systems}},
  journal = {IEEE Transactions on Software Engineering (TSE)},
  venue = {TSE},
  year = {2015},
  doi = {10.1109/TSE.2015.2412134},
  volume = {41},
  number = {8},
  month = {August},
  pages = {764--780},
  issn = {0098-5589},
  note = {Extended and revised version of~\cite{}{Brun11icdcs}.
  \href{http://dx.doi.org/10.1109/TSE.2015.2412134}{DOI:
  10.1109/TSE.2015.2412134}},
	
  previous = {Extended and revised version of "Smart redundancy for
  distributed computation" in ICDCS 2011.},

  abstract = {<p>Developing modern distributed software systems is difficult
  in part because they have little control over the environments in which
  they execute. For example, hardware and software resources on which these
  systems rely may fail or become compromised and malicious. Redundancy can
  help manage such failures and compromises, but when faced with dynamic,
  unpredictable resources and attackers, the system reliability can still
  fluctuate greatly. Empowering the system with self-adaptive and
  self-managing reliability facilities can significantly improve the quality
  of the software system and reduce reliance on the developer predicting all
  possible failure conditions.</p>

<p>We present iterative redundancy, a novel approach to improving software
  system reliability by automatically injecting redundancy into the system's
  deployment. Iterative redundancy self-adapts in three ways: (1) by
  automatically detecting when the resource reliability drops, (2) by
  identifying unlucky parts of the computation that happen to deploy on
  disproportionately many compromised resources, and (3) by not relying on a
  priori estimates of resource reliability. Further, iterative redundancy is
  theoretically optimal in its resource use: Given a set of resources,
  iterative redundancy guarantees to use those resources to produce the most
  reliable version of that software system possible; likewise, given a
  desired increase in the system's reliability, iterative redundancy
  guarantees achieving that reliability using the least resources possible.
  Iterative redundancy handles even the Byzantine threat model, in which
  compromised resources collude to attack the system.

<p>We evaluate iterative redundancy in three ways. First, we formally prove
  its self-adaptation, efficiency, and optimality properties. Second, we
  simulate it at scale using discrete event simulation. Finally, we modify
  the existing, open-source, volunteer-computing BOINC software system and
  deploy it on the globally-distributed PlanetLab testbed network to
  empirically evaluate that iterative redundancy is self-adaptive and more
  efficient than existing techniques.</p>},

  fundedBy = {DARPA N66001-11-C-4021, IARPA N66001-13- 1-2006, 
  NSF CCF-1117593, NSF CCF-1218115, NSF CCF-1321141},
}