Overview and prospect of graph models in molecular informatics
-
Graphical Abstract
-
Abstract
With the rapid growth of molecular data and advances in deep learning, artificial intelligence (AI) has made significant strides in molecular informatics. Molecular informatics is an emerging field that integrates chemistry, computational science, and AI, employs data-driven methods to decode relationships between molecular structures and their properties, thereby supporting drug design and material discovery. Molecular representation learning (MRL) is the core research of molecular informatics, which aims to encode molecular structures and properties into numerical vectors to provide efficient representations for downstream tasks. High-quality molecular representations are critical for accurate property prediction, optimization, and generation. Traditional rule-based MRL methods rely on hand-crafted features, which are time-consuming and expert-dependent. Sequence-based MRL methods like Simplified Molecular Input Line Entry System (SMILES) often separate connected atoms into distant positions, leading to suboptimal representations that fail to fully capture spatial and topological information. Given that molecules naturally form graph structures with atoms as nodes and bonds as edges, graph-based models can effectively utilize these molecular graphs. Recently, graph models have shown exceptional performance in representing complex structures, learning cross-scale features, and constrained optimization. Consequently, graph-based MRL methods have achieved significant advancements in molecular property prediction and generation. In this review, we first introduce the evolution of molecular representation methods, especially focusing on 2D and 3D molecular graph representations. We then classify graph models into discriminative and generative categories and discuss their concepts and applications. Graph discriminative models encode topological structures and node/edge features to capture nonlinear structure-property relationships for classification and regression tasks. Graph generative models learn from molecular distributions to optimize existing structures or design novel compounds with desired properties. Next, we review the commonly used datasets, evaluation metrics and research progress related to molecular property prediction and molecular generation. The goal of molecular property prediction is to predict physical and chemical properties by analyzing internal molecular information such as atomic numbers and coordinates, aiding researchers in rapidly identifying suitable candidates from large pools of potential compounds. We summarize the property prediction methods into three categories: 2D graph-based, 3D graph-based, and domain knowledge-integrated approaches, and we introduce the recent representative method for each category. For molecular generation, the goal is to learn the latent distribution from limited datasets and generate novel structures that meet specific chemical functions through sampling and decoding. We introduce widely used frameworks in molecular generation include Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Normalizing Flows, and Diffusion Models. VAEs utilize latent variables to model data distribution, encoding complex molecular information into compact embeddings. GANs enhance the authenticity and diversity of generated molecules through an adversarial process between generators and discriminators. Normalizing Flow models use reversible mappings to achieve efficient sampling and precise control of the probability density, enabling accurate modeling of generated molecular distributions. Diffusion models generate molecules through the processes of forward noise addition and reverse denoising, which optimizes properties while preserving chemical validity. Finally, we discuss future research directions of graph models in molecular informatics from perspectives of large-scale pre-training, explainable AI, and multimodal learning strategies. This review aims to help molecular informatics researchers quickly identify cutting-edge studies and applicable methods, while clarifying technical pathways for AI researchers to promote more efficient algorithm design and implementation.
-
-