Machine Learning for Scientific Discovery
The computational resource growth in natural science motivates the use of machine learning for automated scientific discovery. However, unstructured empirical datasets are often high dimensional, unlabeled, and imbalanced. Therefore, discarding irrelevant (i.e., noisy and information-poor) features is essential for the automated discovery of governing parameters in scientific environments. To address this challenge, I will present Gaussian Stochastic Gates (STG), which rely on a probabilistic relaxation of the L0 norm of the number of selected features. By applying the Stochastic Gates to a neural network's input layer, I will derive a flexible, fully differentiable model that simultaneously identities the most relevant features and learns complex nonlinear models. The STG neural network outperforms the state-of-the-art feature selection methods, both in terms of predictive power and its ability to correctly identify the correct subset of informative features. The model was successfully applied for critical biological tasks such as COX proportional hazards model and differential expression analysis on HIV and Melanoma patients. Next, using a linear model, I will provide a theoretical basis for optimizing the STG objective using small batches (i.e., SGD). In particular, I will present an approximation bound for estimating an unknown signal based on noisy observations. Finally, I will show an extension of the STG model for unsupervised feature selection. The new model is trained to select features with high correlation with the leading eigenvectors of a gated graph Laplacian. The gating mechanism allows us to re-evaluate the Laplacian for different subsets of features and unmask informative structures buried by nuisance features. I will demonstrate that the proposed approach outperforms several unsupervised feature selection baselines.
תאריך עדכון אחרון : 08/12/2020