Bayesian Nonparametric Structural Learning
Persone
(Responsabile)
Abstract
With nowadays high availability of complex data, more statistical techniques aimed at understanding the underlying dependence structures are needed. Data complexity refers to two separate issues: (i) complex interactions occurring among the random variables object of study, for instance in large networks of protein-protein interactions; (ii) complex data type, for instance Twitter text posts and gene regulatory networks. The huge statistical literature on graphical models has extensively dealt with the first kind of complexity, interaction complexity, by assuming multivariate Gaussian data with graph-driven dependencies. The wider and wider set of statistical models on relational data, high-dimensional data, up to infinite-dimensional parameters, has attacked the second kind of complexity, type complexity, though random network models, sparsity-based inferential methods and random probability measures. In particular, the Bayesian nonparametric literature has proposed several well-motivated prior distributions on random measures, with the purpose of detaching from over-simplified base models often justified by mathematical convenience.Data with complex dependence structure with few exceptions are assumed to follow a well-behaved Gaussian distribution, therefore rendering the related statistical models not useful for data with type complexity. On the other side, statistical models with random distributions tend to be construced for univariate data or for multivariate data with free dependence structure, therefore ignoring complex dependencies. Within the current project we aim to construct a class of models for statistical inference on data that are complex in both the senses outlined above: data of complex type with complex interactions. We therefore plan to combine graphical models with Bayesian nonparametric tools. The observations will not conditionally Gaussian, in contrast to the literature, but will follow some random unknown distribution, to be inferred. The complex unknown dependence structure will be posted on the random distribution atoms, in a hierarchical statistical model. In this way, we propose a joint inferential framework that we name Bayesian Nonparametric Structural Learning (BNP-SL), for both the random distribution and the dependency structure, allowing our model to depart from the simplistic Gaussian assumption, but without abandoning its mathematical advantages. A computational algorithm will exploit the decompositions of the random measure and of the graph to provide a practical tool for model selection and estimation. Beyond the general framework, we plan to develop three relevant examples: (i) an undirected decomposable model with Dirichlet process; (ii) a directed acyclic model with completely random measure; (iii) an essential graph model with Pitman-Yor process. The methodology will be tested on simulated data in comparison with established alternatives and on a gene-regulatory network, where gene expression levels are known to be distant form a multivariate Gaussian distribution and to interact intricately following the nature of the biological processes.