图模型在分子信息学中的研究综述与展望

Overview and prospect of graph models in molecular informatics

  • 摘要: 分子信息学是化学与人工智能交叉融合的前沿领域,旨在通过数据驱动的方法揭示分子结构与性质的内在关联规律,为药物设计与功能材料开发等关键任务提供技术支持。分子表示学习作为其核心基础,通过将分子结构编码为保留其拓扑与理化性质的数值向量,为下游任务提供高效的特征表示。分子天然具有图结构特性,其中原子为节点,化学键为边,分子图能够自然地描述其拓扑信息。图模型凭借强大的拓扑建模能力,已成为分子表示学习的关键方法。根据任务需求可分为两类,图判别式模型专注于捕获分子结构与性质之间的非线性映射关系,图生成式模型致力于逆向设计满足特定功能的新型分子结构。近年来,图模型在分子性质预测、分子生成等任务中展现出显著优势,尤其在结构表征、跨尺度特征学习及约束优化生成等方面取得了重要突破。本文首先介绍了分子表征方法与图模型的基本概念;其次,围绕分子性质预测和分子生成两大核心任务,梳理了常用数据集与评价指标,分类综述了不同图模型的特点与研究进展;最后,结合大规模预训练、可解释性方法及多模态学习等新兴趋势,探讨了图模型在分子信息学中的应用潜力,并展望了未来研究方向。

     

    Abstract: With the rapid growth of molecular data and advances in deep learning, artificial intelligence (AI) has made significant strides in molecular informatics. Molecular informatics is an emerging field that integrates chemistry, computational science, and AI, employs data-driven methods to decode relationships between molecular structures and their properties, thereby supporting drug design and material discovery. Molecular representation learning (MRL) is the core research of molecular informatics, which aims to encode molecular structures and properties into numerical vectors to provide efficient representations for downstream tasks. High-quality molecular representations are critical for accurate property prediction, optimization, and generation. Traditional rule-based MRL methods rely on hand-crafted features, which are time-consuming and expert-dependent. Sequence-based MRL methods like Simplified Molecular Input Line Entry System (SMILES) often separate connected atoms into distant positions, leading to suboptimal representations that fail to fully capture spatial and topological information. Given that molecules naturally form graph structures with atoms as nodes and bonds as edges, graph-based models can effectively utilize these molecular graphs. Recently, graph models have shown exceptional performance in representing complex structures, learning cross-scale features, and constrained optimization. Consequently, graph-based MRL methods have achieved significant advancements in molecular property prediction and generation. In this review, we first introduce the evolution of molecular representation methods, especially focusing on 2D and 3D molecular graph representations. We then classify graph models into discriminative and generative categories and discuss their concepts and applications. Graph discriminative models encode topological structures and node/edge features to capture nonlinear structure-property relationships for classification and regression tasks. Graph generative models learn from molecular distributions to optimize existing structures or design novel compounds with desired properties. Next, we review the commonly used datasets, evaluation metrics and research progress related to molecular property prediction and molecular generation. The goal of molecular property prediction is to predict physical and chemical properties by analyzing internal molecular information such as atomic numbers and coordinates, aiding researchers in rapidly identifying suitable candidates from large pools of potential compounds. We summarize the property prediction methods into three categories: 2D graph-based, 3D graph-based, and domain knowledge-integrated approaches, and we introduce the recent representative method for each category. For molecular generation, the goal is to learn the latent distribution from limited datasets and generate novel structures that meet specific chemical functions through sampling and decoding. We introduce widely used frameworks in molecular generation include Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Normalizing Flows, and Diffusion Models. VAEs utilize latent variables to model data distribution, encoding complex molecular information into compact embeddings. GANs enhance the authenticity and diversity of generated molecules through an adversarial process between generators and discriminators. Normalizing Flow models use reversible mappings to achieve efficient sampling and precise control of the probability density, enabling accurate modeling of generated molecular distributions. Diffusion models generate molecules through the processes of forward noise addition and reverse denoising, which optimizes properties while preserving chemical validity. Finally, we discuss future research directions of graph models in molecular informatics from perspectives of large-scale pre-training, explainable AI, and multimodal learning strategies. This review aims to help molecular informatics researchers quickly identify cutting-edge studies and applicable methods, while clarifying technical pathways for AI researchers to promote more efficient algorithm design and implementation.

     

/

返回文章
返回