Online Data Center Modeling
People
(Responsible)
Dang H. T.
(Collaborator)
External participants
Roscoe Timothy
(Third-party collaborator)
Abstract
This project will create a common data model and representation for the state of an operational data center, driven by real-world use-cases and deployments, which can serve as a solid foundation for cross-layer management of entire distributed hardware and software stacks. Modern data centers are the crucial infrastructure for storing, processing, and distributing information for applications that touch all walks of life, including health care, finance, communications, and other industries. However, data centers are also complex, dynamic, highly networked systems and, as such, their capacity, performance, behavior, and failure modes are difficult to predict, understand, and plan for. A recent survey of federal agencies in the United States reports that 94 percent of federal data centers experience downtime as a result of data center complexity. Reports of data center outages have become commonplace. Notably, outages involving Amazon's EC2 Cloud Computing facilities in turn took down many important websites, of which Reddit, Github, Foursquare and Airbnb were only the most high-profile. A recent outage at Visa left customers unable to use their credit cards. Less visibly, our commercial collaborators in this project confirm a truism in the industry: ensuring that the applications in a data center meet their performance and availability targets in the face of changes in offered load, reconfigurations, upgrades, and equipment and software failures is an extremely difficult problem that costs companies dearly in expense, energy, and employee time. A major reason for this complexity is that the many conceptual layers involved in an enterprise data center (physical network connectivity, physical and virtual machines, link layers, VLANs, routing, service-oriented architectures, application deployment, etc.) are managed today by tools and techniques which focus on only one or a few layers. Worse, these layers are typically operated by different divisions in the organization, creating a strong barrier to the development of commercial management solutions. These layers are also becoming more complex in themselves. An key example of this general challenge is the rise of Software-Defined Networking (SDN) techniques, which promise considerably more efficient and flexible networking at the link and IP layers through a centralized "controller'' which dynamically creates forwarding table entries in switches. However, this power comes at the cost of increased complexity. SDN control software is itself a complicated distributed system, that must maintain distributed state gathered from a variety of heterogeneous devices using asynchronous communication. Anecdotally, network administrators have been hesitant to deploy SDNs, because the increased automation makes it harder for them to track down problems and generally understand anomalous behavior in the system. As a result of this reluctance, many networks only partially deploy SDNs, until they gain familiarity and confidence, leading to difficulties in which SDN and legacy network software co-exist in the same network. More importantly, today's SDN controllers to date (primarily based on the OpenFlow standard) operate at the level of IP flows rather than taking a global view of the data center state which can be reasoned about online. Moreover, they expose no internal representation that would allow the network layers they manage to be coupled to other layers of the stack (such as the several application levels, or the physical infrastructure). This project will take a radically different approach. Rather than focussing on mechanisms to control and manage subsets of a data center, we will create a data model and \emph{representation} of the state of a data center which can be populated and driven by logs, traces, and configuration information; queried by operators to determine global properties of the system (such as traffic matrices), and drive online workload-driven simulations to explore the effects of configuration changes. Such a data model and schema for a modern data center does not at present exist. Beyond these immediate applications, we argue that without such a foundational substrate, any attempt to manage an entire data center will result in a management system which is ad-hoc, brittle in the presence of changes, and highly specific to a single context - in short, it will resemble the "home-grown'' systems in use today. We are in a highly advantageous position to carry out this work. Our collaborators in industry have data centers and networks which are highly instrumented, and have agreed to share this instrumentation data with us. The work is also timely from a technological perspective: modern server technology can process the large volumes of trace data generated by a center (about 2TB/day in the case of Amadeus for example) in a compact space, and parallel data processing systems are appearing in the research community which can integrate graph processing, data streams, stored relational data, and continuous, incremental online queries within a single framework - exactly the workload that data center modelling presents. The primary goal of the proposed project is not to focus on the systems issues involved in building such as system. Instead, we plan to leverage existing research systems such as Naiad and ongoing work at ETH into parallel data processing and information management, as much as possible in the short term. As the project matures, we expect that the process of developing the data center model will provide insights into system design, that will lay a foundation for future systems work. Neither will the project address the "actuation'' part of data center management - it will concentrate solely on building and maintaining a representation rather than taking any action based on this. There are several reasons for this. Firstly, we feel strongly that without a clear logical foundation in representation, control policies will be too ad-hoc, and it is essential to get this representation correct first. Secondly, it is easier to deploy a prototype system operationally in a real data center (something we plan to do in this project) if it both delivers value to operators (by providing information views not otherwise available to them) and poses no threat to the infrastructure, instead merely ingesting trace data and providing a query interface over system state. Instead, the work in the proposal will perform the vital task of creating, refining, deploying, and validating the abstract representation of the state of a data center. In doing so, we will build upon but extend recent work in applying ideas from both knowledge representation and programming language semantics to understanding the operation of networks, including existing work by the PIs themselves. If successful, we expect the results of this project to have a broad impact far beyond providing useful tools for data center operators. Our goal is to provide a shared substrate for diverse data center management functionality, analogous to the way that the relational model of databases provided a common substrate for tabular data. Such an extensible model can act as a disruptive incentive to the management software industry, serve as a basis for interoperability, comparative benchmarking, and verifiable SDN controllers. In the longer term, we hope to greatly further our understanding of design principles for the control planes of networks and distributed systems.
Additional information
Start date
01.12.2015
End date
30.11.2018
Duration
36 Months
Funding sources
SNSF
External partners
In collaborazione con Prof. Roscoe, ETHZ
Status
Ended
Category
Swiss National Science Foundation /
Project Funding /
Mathematics, Natural and Engineering Sciences (Division II)
Publications
- Dang H. T., Bressana P. G., Wang H., Lee K. S., Zilberman N., Weatherspoon H., Canini M., Pedone F., Soulé R. (2019) Partitioned Paxos via the Network Data Plane (2019/01)
- Rogora D., Diwan A., Smolka S., Carzaniga A., Soulé R. (2017) Performance Annotations for Cloud Computing. USENIX. 9th USENIX Workshop on Hot Topics in Cloud Computing