Project 1 - Quentin Velard

Project Details

This Project, authored by Quentin Velard and Salma Bouaouda under Guenael Cabanes at École des Mines de Nancy, leverages deep learning to generate biomolecules with anti-inflammatory (AI) and antioxidant (AO) properties. This work supports the "Biomolecules 4 Bioeconomy" framework, with potential applications in pharmaceuticals, agrochemicals, and cosmetics.

Objectives

The project focuses on two main goals:

State-of-the-Art Review: Analyze existing deep learning methods for molecular generation and identify pretrained models to adapt.
Model Training: Develop neural networks using transfer learning to create biomolecules with AI and AO properties.

With a molecular space of approximately \(10^{60}\) possibilities and limited open-source data, traditional discovery is slow—deep learning aims to change that!

🗝️ Key Concepts Explained

Anti-inflammatory (AI) Biomolecules: Reduce inflammation by targeting mediators like cytokines (e.g., curcumin, NSAIDs).
Antioxidant (AO) Biomolecules: Neutralize free radicals to prevent damage (e.g., vitamins C and E, BHT).
SMILES Notation: Represents molecular structures as text (e.g., "CCO" for ethanol) for computational use.

🔧 Methodology

The approach combines Large Language Models (LLMs) and Generative Adversarial Networks (GANs) across two pipelines: Data Generation and Data Prediction.

📊 Data Collection

The team sourced 1,883 AI molecules from PubChem, noting properties like molecular weight and hydrophobicity. Limited data availability posed a challenge, addressed through augmentation.

Architectures

LLM Fine-Tuning: ChemBERTa, pretrained on 77 million SMILES from PubChem, was fine-tuned to predict AI/AO properties.
GAN + LLM Hybrid: GPT-2_zinc_87m (pretrained on 480 million SMILES from ZINC) generates molecules, while a discriminator ensures validity—a fresh take on molecular design!

Pipelines

Data Generation Pipeline:
- Data Augmentation Layer (DAL): Expands the dataset to 11,503 molecules via tautomers and permutations.
- GAN + LLM: Creates and filters new molecules.
- Linear Discriminant Analysis (LDA): Keeps the most relevant features.
Data Prediction Pipeline: Fine-tunes ChemBERTa to predict AI/AO properties, enhancing the dataset further.

📈 Experimental Results

Check out the key findings below in two handy tables!

Data Augmentation Layer (DAL)

Metric	Initial Dataset	Tautomers Only	Permutations Only	Tautomers + Permutations
Number of Molecules	1,883	3,359	10,015	11,503
Validity	100%	100%	100%	100%
Originality	0%	44%	81%	84%
Diversity	0.91	0.90	0.88	0.89
Drug-likeness	79%	74%	72%	72%

Note: The DAL boosted the dataset size dramatically while keeping validity perfect, though drug-likeness dipped slightly.

Model Performance

Metric	LLM Alone	GAN + LLM
Validity	16%	100%
Originality	100%	100%
Diversity	0.56	0.85
Drug-likeness	16%	95%
Accuracy	9%	62% (76% with DAL)

Note: The GAN + LLM hybrid shines with 100% validity and 95% drug-likeness, hitting 76% accuracy with DAL support.

🌟 Why It Matters

This project speeds up biomolecule discovery for critical industries, offering a scalable, innovative framework with its GAN+LLM hybrid.

🚧 Challenges and Future Directions

Data scarcity, 76% prediction accuracy, and complex biological interactions remain hurdles requiring more validation. Next steps could involve bigger datasets, reinforcement learning, or even quantum computing for better efficiency.

View Project PDF 📄 Next Page 🚀

Biomolecule Generation with Anti-oxydant and Anti-inflammatory properties

Deep Learning for Computationnal Biology