ML2 - Multilevel and Domain Decomposition Methods for Machine Learning
Learning takes time. The dominant method for training deep neural networks, the stochastic gradient method, requires a substantial amount of steps until acceptable weights have been identified. As a consequence, a huge amount of computational resources, time, and energy has to be spent when a network is trained. Various approaches towards finding more efficient training process have been developed, e.g. accelerated gradient methods or methods based on the approximation of second order information. Alternative approaches, such as using less accurate floating point representations for storing the weights can also help to reduce the computational impact of training. Still, the development of efficient training methods remains an open challenge. Two major obstacles can be identified on the way to more efficient training methods. The first obstacle is the non-convexity of the loss functionals, which is related to a quite rich energy landscape with distinctive features. A good set of weights is usually thought to be characterised by a "stable" stationary point in the sense that a smaller variation of the weights will not influence the loss functional significantly. In other words, the properties of the training method directly influence the quality and stability of the resulting approximation provided by the network. The second obstacle is the shear size of the problems - big data induces large scale Hessians, which prevents the straight forward application of many classical techniques from optimization. In fact, whenever classical Newton-like techniques are applied, linear systems have to be set up and possibly solved, which might simply be too large for the available (GPU) memory. Within this project "ML2 - Multilevel and Domain Decomposition for Machine Learning" we will address both of the difficulties explained above by transferring ideas from multilevel methods, in particular multigrid (MG), and domain decomposition (DD) to machine learning. Originally developed for the solution of symmetric positive definite linear systems arising from the discretization of elliptic PDEs, MG and DD are nowadays used as massively parallel solution methods with (quasi-)optimal complexity for various types of linear and non-linear large-scale systems. During the last two decades, the idea of multi-level decompositions has also be applied to convex and non-convex minimization problems, leading to multi-level minimizations methods such as RMTR, Monotone Multigrid, or MG/OPT. These methods show significant to drastic improvements in terms of robustness and speed for large-scale minimization problems, when compared to a simple gradient method. Additionally, they allow for massive parallelism. Within ML2, we will develop parallel multi-level minimization methods for training neural networks, which will be designed to deal with the large scale non-convex minimization problems at hand in an efficient way. ML2 will exploit the optimality of advanced parallel iterative solution methods for machine learning. The binding component between both worlds will be the multilevel decomposition, which can be found in both, neural networks and multigrid methods, as a hierarchy of problem dependent approximation spaces.