Biomolecule Generation with Anti-oxydant and Anti-inflammatory properties

Deep Learning for Computationnal Biology

Project Details

This Project, authored by Quentin Velard and Salma Bouaouda under Guenael Cabanes at École des Mines de Nancy, leverages deep learning to generate biomolecules with anti-inflammatory (AI) and antioxidant (AO) properties. This work supports the "Biomolecules 4 Bioeconomy" framework, with potential applications in pharmaceuticals, agrochemicals, and cosmetics.

Objectives

The project focuses on two main goals:

  1. State-of-the-Art Review: Analyze existing deep learning methods for molecular generation and identify pretrained models to adapt.
  2. Model Training: Develop neural networks using transfer learning to create biomolecules with AI and AO properties.

With a molecular space of approximately \(10^{60}\) possibilities and limited open-source data, traditional discovery is slow—deep learning aims to change that!

🗝️ Key Concepts Explained

  • Anti-inflammatory (AI) Biomolecules: Reduce inflammation by targeting mediators like cytokines (e.g., curcumin, NSAIDs).
  • Antioxidant (AO) Biomolecules: Neutralize free radicals to prevent damage (e.g., vitamins C and E, BHT).
  • SMILES Notation: Represents molecular structures as text (e.g., "CCO" for ethanol) for computational use.

🔧 Methodology

The approach combines Large Language Models (LLMs) and Generative Adversarial Networks (GANs) across two pipelines: Data Generation and Data Prediction.

📊 Data Collection

The team sourced 1,883 AI molecules from PubChem, noting properties like molecular weight and hydrophobicity. Limited data availability posed a challenge, addressed through augmentation.

Architectures

  1. LLM Fine-Tuning: ChemBERTa, pretrained on 77 million SMILES from PubChem, was fine-tuned to predict AI/AO properties.
  2. GAN + LLM Hybrid: GPT-2_zinc_87m (pretrained on 480 million SMILES from ZINC) generates molecules, while a discriminator ensures validity—a fresh take on molecular design!

Pipelines

  1. Data Generation Pipeline:
    • Data Augmentation Layer (DAL): Expands the dataset to 11,503 molecules via tautomers and permutations.
    • GAN + LLM: Creates and filters new molecules.
    • Linear Discriminant Analysis (LDA): Keeps the most relevant features.
  2. Data Prediction Pipeline: Fine-tunes ChemBERTa to predict AI/AO properties, enhancing the dataset further.

📈 Experimental Results

Check out the key findings below in two handy tables!

Data Augmentation Layer (DAL)

Metric Initial Dataset Tautomers Only Permutations Only Tautomers + Permutations
Number of Molecules 1,883 3,359 10,015 11,503
Validity 100% 100% 100% 100%
Originality 0% 44% 81% 84%
Diversity 0.91 0.90 0.88 0.89
Drug-likeness 79% 74% 72% 72%

Note: The DAL boosted the dataset size dramatically while keeping validity perfect, though drug-likeness dipped slightly.

Model Performance

Metric LLM Alone GAN + LLM
Validity 16% 100%
Originality 100% 100%
Diversity 0.56 0.85
Drug-likeness 16% 95%
Accuracy 9% 62% (76% with DAL)

Note: The GAN + LLM hybrid shines with 100% validity and 95% drug-likeness, hitting 76% accuracy with DAL support.

🌟 Why It Matters

This project speeds up biomolecule discovery for critical industries, offering a scalable, innovative framework with its GAN+LLM hybrid.

🚧 Challenges and Future Directions

Data scarcity, 76% prediction accuracy, and complex biological interactions remain hurdles requiring more validation. Next steps could involve bigger datasets, reinforcement learning, or even quantum computing for better efficiency.

View Project PDF 📄 Next Page 🚀

Address


Paris, France

Phone