Cosmmus 2 - An infrastructure for scalable distributed applications
Many current online services must meet strict availability and performance requirements. High availability implies tolerating server failures, which calls for redundancy of servers and data, that is, replication. Although replication is needed for fault tolerance, having two or more replicas capable of serving client requests introduces additional complexity. For example, how to avoid that two clients simultaneously book the same airline seat, after each one accesses a different replica? Obviously, replicas cannot act alone and must coordinate. But simply having one replica wait for the acknowledge of the other won’t do it since the other may have crashed (accurately detecting the crash of a replica is a problem in its own).
State machine replication is a solution to these problems. The idea is to arrange for replicas to receive and execute client requests in the same order. The execution of each request must be deterministic so that replicas produce the same results after the execution of the same requests. Therefore, clients know that a request has been executed after they receive the first response from a replica. In the example above, even though each client contacts a different replica, their requests will be executed by the two replicas in the same order, and the replicas agree that only the client whose request is executed first should book the seat.
State machine replication is a well-established solution to increasing the availability of services and has been deployed in many production systems. When it comes to performance, however, it offers limited capability. Increasing the number of replicas will not translate into more requests execute per time unit since each replica must process all requests. This project investigates extensions to the state machine replication that will make it configurable with respect to both availability and performance.