TranScale - Towards a principled approach to highly available and scalable systems
People
(Responsible)
Heidaripour Lakhani V.
(Collaborator)
Abstract
Many distributed applications have stringent availability and scalability requirements. A highly available application remains operational despite failures; a scalable application can gracefully handle growing amounts of work (e.g., client requests). Designing and implementing systems that both tolerate failures and scale performance is difficult but of utmost importance.State machine replication (SMR) is a well-established approach to designing highly available systems. SMR dictates how client commands must be propagated to and executed by the replicas in order for clients to observe ``single-server behavior" (i.e., strong consistency). In essence, each non-faulty replica must deterministically execute every command in the same order. SMR provides replication transparency for both clients and application programmers. Clients experience intuitive application behavior and programmers can focus on the application's inherent complexity and disregard the complications that replication introduces (e.g., how many replicas must be involved in the execution of a command).SMR provides configurable availability, determined by the number of replicas, but offers limited scalability. Since every replica in the system must execute all commands, performance does not increase with additional replicas. Scalable performance can be achieved in distributed systems with sharding, that is, partitioning and distributing the application state. Just like replication, sharding raises fundamental questions about transparency: Should clients and application programmers be aware that the application state is partitioned? Should clients expect non-intuitive application behavior due to sharding? Should programmers develop applications that account for sharded state?The goal of this project is to develop a principled approach to highly available and scalable distributed systems. The project will investigate techniques that provide replication and partitioning transparency. In the spirit of state machine replication, we posit that clients and application developers should be spared from the complexity introduced by replication and sharding. In order to achieve this goal, the project will tackle four complementary areas, which can be summarized by the following research questions. Although these questions are connected by a unified goal, each question represents a formidable challenge in its own right.- How to execute commands in a sharded and replicated system?- How to order commands in the presence of sharding and replication?- How to partition application state for performance?- How to design applications that are ``partition-friendly"?The rest of this document reviews the state of the art in each of the research topics, describes our contributions towards the project's goals, and detail the problems we intend to tackle and how we plan to reach our goals