Statistical learning and inference on big data with probabilistic graphical models
We propose a principled framework for learning and inference on big data with probabilistic graphical models. The main topics of the project are big-data hypothesis testing; big-data structural learning; big-data inference. Massive data sets involve growth both in the number of data points (instances) and in the number of variables (features). This implies searching over large classes of hypotheses, obtained by combining multiple variables. The number of hypotheses in this way can grow exponentially fast with the number of variables. Moreover often one is interested in long-tail events, which can well be rare despite the availability of massive data, in particular when the number of variables is large. This calls for statistically sound approaches for controlling the error rates when analyzing big data. We address this problem devising non-parametric Bayesian hypothesis tests for big data. We motivate this choice discussing the advantages of Bayesian hypothesis testing over frequentist ones and the advantages of non-parametric over parametric tests. To devise the tests, we will use the Dirichlet process as a prior. In a first approach, we will consider the limiting posterior distribution of the Dirichlet process called Bayesian bootstrap. The bootstrap (both Bayesian and frequentist) is however not feasible on big data, because of its memory requirements. However the recently introduced big-data bootstrap allows computational savings compared to the traditional bootstrap. The big-data bootstrap has been used so far in a frequentist way, namely to assess the uncertainty of the estimates w.r.t. many replication of the data set, sampled from the same population. Our idea is to borrow the computational idea of the big-data bootstrap and to accommodate it within the Bayesian setting. In particular, we will exploit it to sample from the posterior distribution of the Dirichlet process. This would yield the first computational framework for Bayesian non-parametric hypothesis testing on big data. In our second approach, we will consider the extension of the Dirichlet process to robustness called imprecise Dirichlet process; this will help us implement an alternative strategy to develop new robust tests for big data. These will have the great potential of being much faster than the Bayesian bootstrap to realize and also being more robust to long-tail events. Based on these ideas, we will develop different tests, including tests of independence. The tests of independence will be used for structural learning of Bayesian networks. The traditional PC-algorithm identifies the structure of a Bayesian network by performing repeated tests of independence. The independence test are performed in a frequentist way. The PC algorithm has been successfully adopted in the p >> n case (where p denotes the number of features and n the number of data points). Here we plan to make it scalable also to data set where n is huge. We will plug our big-data Bayesian independence test within the PC algorithm. The Bayesian tests will allow us to minimize a wide variety of loss functions while learning the structure. Different loss functions imply a different trade-off between true negative and the false positive errors. Optimising such losses is not possible with the frequentist tests which are tied to control the Type I error or the false-discovery error. Implementing the PC algorithm in parallel seems quite challenging, since it is naturally a sequential task. However we will attempt also this direction. Parallelization plays instead a key role in inference. We will implement large scale distributed inference algorithm for Bayesian network using belief propagation (BP). The running time of BP is linear with the number of edges in the graph. However, a distributed implementation is necessary when dealing with very big graphs. The proposed implementation will be based on Hadoop, an open source version of the MapReduce framework. We propose different ideas. The goal is to decrease both running time and storage of BP, compared to previous approaches dealing with BP on big data. If these ideas prove successful, they will be brought also to probabilistic graphical models other than Bayesian networks, such as conditional random fields. Indeed BP is suitable for inference on different types of probabilistic graphical models. We plan to validate the framework, constituted by learning and inference algorithms, on different big data case studies. Some case studies will be taken from the Kaggle platform, which hosts many competitions on big data.