Replication of data and services is a fundamental building block in the design of distributed systems. Though replication has several benefits for large-scale systems distributed across wide-area networks, it also introduces considerable complexity over a centralized unreplicated system. This thesis contributes mechanisms and algorithms that enable the design of simpler and more efficient large-scale replicated systems. On the mechanism front, we present aggressive speculative replication (ASR) as a basic primitive in building large-scale replicated systems. ASR refers to aggressively trading-off costs of hardware resources for making replicas of content and updates in a speculative manner in return for savings in human time and improved service quality. Today, replicated systems are unable to avail of the benefits of ASR as current architectures rely on manually-tuned parameters to manage the complexity of large-scale replicated systems. Consequently, such systems are hard to build and maintain, inefficient in utilizing available resources, and prone to the risk of overload. As a solution, we present an architecture, Mars, that performs ASR in a self-tuning manner, i.e. without the need for manual tuning. To enable realization of Mars' architecture in practical systems, we build TCP Nice, an end-to-end transport protocol for background transfers. We demonstrate the benefits of Mars through a case study of a Web prefetching system, NPS, and show that the Mars approach simplifies the design, efficiently uses available resources to give performance benefits, is robust to the risk of system overload, and is easily deployable without changing existing infrastructure by using only simple end-to-end mechanisms. On the algorithmic front, we make three contributions. First, we present a speculative replication algorithm, namely, long-term prefetching, to minimize average response times at a large cache constrained by bandwidth and validate its effectiveness through simulations using Web traces. Next, we develop a speculative replication algorithm for minimizing response times in a set of hierarchically-organized distributed cooperating caches constrained by bandwidth, and show that it is constant-competitive. Finally, we study the theoretically intriguing problem of minimizing average response times in a set of hierarchically-organized distributed cooperating caches constrained by storage space and show a nonconstant lower bound on the competitive ratio of any online algorithm that is allowed at most a constant-factor space advantage.
@phdthesis{phd-thesis,
author = {Arun Venkataramani},
institution = {Computer Sciences, University of Texas at Austin},
month = {December},
title = {Mechanisms and Algorithms for Large-Scale Replication Systems},
url = {http://www.cs.umass.edu/~arun/papers/arun_phdthesis.pdf},
year = {2004}
}