Project Details
This Project, authored by Quentin Velard and Salma Bouaouda under Guenael Cabanes at École des Mines de Nancy, leverages deep learning to generate biomolecules with anti-inflammatory (AI) and antioxidant (AO) properties. This work supports the "Biomolecules 4 Bioeconomy" framework, with potential applications in pharmaceuticals, agrochemicals, and cosmetics.
Objectives
The project focuses on two main goals:
- State-of-the-Art Review: Analyze existing deep learning methods for molecular generation and identify pretrained models to adapt.
- Model Training: Develop neural networks using transfer learning to create biomolecules with AI and AO properties.
With a molecular space of approximately \(10^{60}\) possibilities and limited open-source data, traditional discovery is slow—deep learning aims to change that!
🗝️ Key Concepts Explained
- Anti-inflammatory (AI) Biomolecules: Reduce inflammation by targeting mediators like cytokines (e.g., curcumin, NSAIDs).
- Antioxidant (AO) Biomolecules: Neutralize free radicals to prevent damage (e.g., vitamins C and E, BHT).
- SMILES Notation: Represents molecular structures as text (e.g., "CCO" for ethanol) for computational use.
🔧 Methodology
The approach combines Large Language Models (LLMs) and Generative Adversarial Networks (GANs) across two pipelines: Data Generation and Data Prediction.
📊 Data Collection
The team sourced 1,883 AI molecules from PubChem, noting properties like molecular weight and hydrophobicity. Limited data availability posed a challenge, addressed through augmentation.
Architectures
- LLM Fine-Tuning: ChemBERTa, pretrained on 77 million SMILES from PubChem, was fine-tuned to predict AI/AO properties.
- GAN + LLM Hybrid: GPT-2_zinc_87m (pretrained on 480 million SMILES from ZINC) generates molecules, while a discriminator ensures validity—a fresh take on molecular design!
Pipelines
- Data Generation Pipeline:
- Data Augmentation Layer (DAL): Expands the dataset to 11,503 molecules via tautomers and permutations.
- GAN + LLM: Creates and filters new molecules.
- Linear Discriminant Analysis (LDA): Keeps the most relevant features.
- Data Prediction Pipeline: Fine-tunes ChemBERTa to predict AI/AO properties, enhancing the dataset further.
📈 Experimental Results
Check out the key findings below in two handy tables!
Data Augmentation Layer (DAL)
Metric | Initial Dataset | Tautomers Only | Permutations Only | Tautomers + Permutations |
---|---|---|---|---|
Number of Molecules | 1,883 | 3,359 | 10,015 | 11,503 |
Validity | 100% | 100% | 100% | 100% |
Originality | 0% | 44% | 81% | 84% |
Diversity | 0.91 | 0.90 | 0.88 | 0.89 |
Drug-likeness | 79% | 74% | 72% | 72% |
Note: The DAL boosted the dataset size dramatically while keeping validity perfect, though drug-likeness dipped slightly.
Model Performance
Metric | LLM Alone | GAN + LLM |
---|---|---|
Validity | 16% | 100% |
Originality | 100% | 100% |
Diversity | 0.56 | 0.85 |
Drug-likeness | 16% | 95% |
Accuracy | 9% | 62% (76% with DAL) |
Note: The GAN + LLM hybrid shines with 100% validity and 95% drug-likeness, hitting 76% accuracy with DAL support.
🌟 Why It Matters
This project speeds up biomolecule discovery for critical industries, offering a scalable, innovative framework with its GAN+LLM hybrid.
🚧 Challenges and Future Directions
Data scarcity, 76% prediction accuracy, and complex biological interactions remain hurdles requiring more validation. Next steps could involve bigger datasets, reinforcement learning, or even quantum computing for better efficiency.
View Project PDF 📄 Next Page 🚀