Explainable AI for Pharmacophore-Based Drug Activity Prediction
Joanna Ceklarz
Abstract
Pharmacophore representations are commonly used by medicinal chemists to identify and visualize structures necessary for biological function. Using them as representations of molecules for Graph Neural Network (GNN) training has yet untapped potential in deep learning. Deep learning architectures, to which GNNs belong, despite their excellent performance in many areas, come with one critical disadvantage - reduced or non-existent comprehensibility on how they reach their results. Many GNN-specific methods have been developed to answer questions about both feature and structural importance. When combined with the proposed pharmacophore representations, these methods could provide valuable insights to model users. Their application to chemical data, however, remains largely unexplored. We have developed two GNN models, a Graph Convolutional Network and a Graph Isomorphism Network, trained on 2D pharmacophore representations of small molecules, for drug activity prediction. We compare the results against shallow models, and against GNNs trained on traditionally used atomic representations of molecules. Using a selection of techniques, we aim to explain results of such models. We plan to obtain both local (molecule-level) and global (model-level) explanations, allowing us to analyse individual predictions as well as overarching model behaviour, to help identify the sources of errors and refine our models accordingly.
Respirium
poster
Wed-Thu
9:00 - 18:00
Generative AI Pipeline for de novo Design of Pseudo-Natural-Products
Son Ha
Abstract
Natural products (NPs) have long served as rich sources of biologically active compounds, with many successful drugs originating from them. Pseudo-Natural-Products (PNPs) are novel NP-inspired compound classes which combine the biological relevance of NPs with the efficient exploration of chemical space by fragment-based compound design. In this talk, we describe our progress and results in building a Generative AI modelling pipeline for de novo design of PNPs.
Respirium
poster
Wed-Thu
9:00 - 18:00
pyPept: a python library to support the modelling of peptides
Thomas Fox
Abstract
We have developed a python library, pyPept, that supports the modeling of peptides in our drug discovery projects. It relies heavily on rdkit to convert even complex peptide sequences (which may contain non-natural amino acids, non-amino acid residues, cycles and branching) into a atomistic representation and 3D conformers. It is currently tthe basis of many tools that we employ in the context of peptide design.
Respirium
poster
Wed-Thu
9:00 - 18:00
Practical Molecular Property Prediction and Explainability with MolPipeline
Christian Feldmann
Abstract
MolPipeline extends scikit-learn's machine learning capabilities for chemical compound tasks by leveraging RDKit. We integrated XAI methods from the SHAP library with MolPipeline to enable easy and effective interpretabilty of property prediction models. The integration includes automatic extraction of chemical information from a model pipeline and communicating explanations through visualizing of important contributions on the molecular structure and other explanatory information.
Respirium
poster
Wed-Thu
9:00 - 18:00
De Novo Drug Design: Combining Pharmacophore Modeling, 3D Shape Matching, and Generative AI
Jozef Fulop
Abstract
We introduce an open-source pipeline for de novo drug design, integrating pharmacophore modeling, 3D shape matching, and generative AI to enable scaffold hopping and lead discovery. The workflow aligns ligands by shape and pharmacophoric features, filters them for Tanimoto similarity and synthetic accessibility, and employs DrugEx to generate novel chemotypes. We demonstrate its utility with a case study on the chemokine receptor CCR2. A comparison of ROCS/OMEGA and RDKit/CDPKit explores trade-offs in computational speed and optimization.
Respirium
poster
Wed-Thu
9:00 - 18:00
Balancing Data Quantity and Quality: Evaluating Curation Strategies for Bioactivity Prediction Models
Carl Schiebroek
Abstract
Building good models to predict the bioactivity of novel chemical matter remains a
challenging task. Accurate models require a training set with a large number of diverse
samples and a low level of noise. When extracting data from public databases such
as ChEMBL, different levels of curation rigor may be applied, resulting in training
sets of varying size, diversity, and, presumably, noise levels. It is not possible to know
a priori whether increasing the size of the dataset at the cost of adding more noise
improves model generalization. To assess this trade-off, we compare three data curation
and modeling approaches: (1) models trained on data for a single target, (2) models
trained on target-specific data further restricted to a single set of assay conditions,
and (3) multitask models where each assay condition is treated as a separate task. We
evaluate these models using leave-assay-out, a surrogate for leave-chemical-matter-out,
with graph neural networks (GNNs) and random forests (RFs). We find no meaningful
differences between these curation strategies and model types, suggesting that adding
more data at the expense of increased variability does not improve generalizability.
Additionally, we show that GNN models can display a high seed-dependent variability,
highlighting the need for evaluating them using multiple seeds
Respirium
poster
Wed-Thu
9:00 - 18:00
European Chemical Biology Database
Milan Voršilák
Abstract
The European Chemical Biology Database (ECBD, https://ecbd.eu) serves as the central repository for data generated by the EU-OPENSCREEN (EU-OS) research infrastructure consortium.
Respirium
poster
Wed-Thu
9:00 - 18:00
Expanding the Double Cubic Lattice Method for Polar Surface Area and Other Potentially Cool 3D Descriptors
Rachael Pirie
Abstract
The Double Cubic Lattice Method (DCLV) for calculating molecular volume and surface area was introduced by Eisenhaber et al. as a quicker, lower memory alternative to approaches available at the time. For a molecule represented by a dot surface, the DCLV is computed as the name suggests by using two cubic lattices: one to group nearby atoms and the other to group nearby dots to identify where overlap occurs between atoms to prevent double-counting in the volume calculation. At last year’s UGM we presented our RDKit implementation (2024.03 release) of the algorithm to return volume, surface area and Van der Waals volume for small molecules, proteins and protein-ligand complexes along with benchmarking results comparing these to the methods already available in the RDKit. This poster expands on this work, presenting some efficiency improvements to the existing algorithm, as well as investigating the extension of the method to include polar surface area and other potentially useful 3D descriptors derived from the dot surface.
Respirium
poster
Wed-Thu
9:00 - 18:00
BLINCS: Breadth-first Line Notation for Chemical Structures
Wim Dehaen
Abstract
The usage of line notations in chemical language models (CLMs) has been highly successful. Nonetheless, atoms that are topologically close in the chemical graph are not necessarily close in the SMILES string. A breadth-first traversal of the chemical graph would ensure this distance correspondence better. To investigate this effect, we propose a new line notation with a SMILES-like syntax but a breadth-first graph traversal. This notation is based on the insight that a degree sequence plus ring closure operations can encode any simple graph concisely. Thus, with the addition of atom symbols and bond symbols any chemical graph can be encoded. We compare the properties of this notation with other line notations, and apply it to CLMs. We provide an RDKit-based open-source implementation of a BLINCS parser that can convert molecular graphs from and to BLINCS.
Respirium
poster
Wed-Thu
9:00 - 18:00
ChemPatentizer: Transforming Chemical Patents into Actionable Scientific Data
Riccardo Fusco
Abstract
Chemical patents contain valuable drug discovery data, but extracting it is a nightmare. These documents are often hundreds of pages of scanned images with inconsistent formats and poor-quality chemical structures. We developed ChemPatentizer, a semi-automated tool that finally makes this data usable.
Instead of trying to fully automate everything (which is extremely complicated due to the heterogeneity of patents), our approach smartly combines human expertise with AI-powered structure recognition. Chemists guide the initial steps, then the pipeline automatically converts messy patent data into clean, analyzable tables. This means researchers can now efficiently extract structure-activity relationships from patents and use them for drug design.
We've successfully tested ChemPatentizer on GLP-1 receptor patents, proving it can transform previously inaccessible patent data into a practical resource for drug discovery.
Respirium
poster
Wed-Thu
9:00 - 18:00
OpenMMDL: Building, Simulating, and Analyzing Protein–Ligand Systems in OpenMM
Valerij Talagayev
Abstract
The presentation would be about OpenMMDL, which is a workflow consisting of OpenMMDL Setup, a tool consisting of an web-based GUI to make the preparation of OpenMM protein-ligand simulation files easy for beginners, allowing a new user to either prepare the files in PDBFixer or Ambertools, with a step-by-step GUI allowing to have the optimal simulation settings, with default settings being in place as well. OpenMMDL simulation, which is the backend that performs the simulation and the postprocessing via MDTraj and MDAnalysis to deliver an output, that is directly ready to be used for OpenMMDL Analysis, which allows to track stable waters in the simulation, create protein-ligand interaction fingerprints and use those to generate binding modes, which are combinations of interactions, thus showing the most common combination of interactions between the ligand and protein.
The lightining talk will focus on the quick introduction in each of the three parts of the workflow with additionally highlighting the implementation of ProLIF for the protein-ligand interaction as an additional option to the already present PLIP package in the newest release of OpenMMDL and briefly highlight the new additions to ProLIF, including the implementation of water-bridge mediated interactions, improvement of visualization and H-Bond interactions from implicit hydrogens with the latter two additions being performed as part of Google Summer of Code 2025 with both OpenMMDL and ProLIF using RDKit in various parts of the packages.
https://summerofcode.withgoogle.com/programs/2025/projects/XWsglxQM
https://summerofcode.withgoogle.com/programs/2025/projects/5Otkx8vp
Respirium
poster
Wed-Thu
9:00 - 18:00
Considerations and Challenges in Chemistry Database Search: Addressing Diverse Requirements and Complex Cases
Susan Leung
Abstract
Different organizations operate within distinct chemical landscapes, resulting in varying requirements for chemistry search solutions provided to scientists. Here, we highlight key considerations and challenges identified by comparing chemistry search approaches. Drawing on practical examples—including tautomers, stereoisomers, biopolymers, ambiguous representations, and other edge cases—we present findings on both the impact of search method selection and speed benchmarks measured under realistic load conditions.
Respirium
poster
Wed-Thu
9:00 - 18:00
Advancing Evaluation of Molecular Generators Using Scaffold-Based Metrics
Valeriia Fil
Abstract
Molecular generators are widely used to explore chemical space and
propose novel compounds with desired properties. However, evaluating
their performance remains challenging due to the structural diversity
and scale of the generated molecules. Standard benchmarks do not fully
capture the primary objective of molecular generation: the discovery
of new biologically active compounds. To address this gap, we
introduce scaffold-based metrics that measure a generator’s ability to
recover biologically relevant scaffolds absent from the input data. We
applied these metrics to several molecular generators, including
Molpher and DrugEx. The DrugEx Graph Transformer showed the highest
scaffold recall and strong scaffold hopping capabilities. These
results demonstrate that scaffold-based metrics offer a more
biologically meaningful perspective for evaluating molecular
generators and support the development of more effective virtual
libraries for drug discovery.
Respirium
poster
Wed-Thu
9:00 - 18:00
Introduction to OpenBind
Ed Griffen
Abstract
Poster showing what the OpenBind Consortium is hoping to deliver and how. https://www.gov.uk/government/news/uk-to-become-world-leader-in-drug-discovery-as-technology-secretary-heads-for-london-tech-week you might be able to access the data.
Respirium
poster
Wed-Thu
9:00 - 18:00
EasyDock 1.0: customizable and scalable docking tool
Guzel Minibaeva
Abstract
EasyDock 1.0 - an open-source and scalable Python-based tool for fully automated molecular docking. The current version supports popular docking programs, namely Autodock Vina, gnina, and smina. The tool automatically prepares ligands by removing salts, generating initial conformers and stereoisomers, using RDKit, and performing protonation with the open-source program MolGpKa. Ring sampling is implemented to improve docking of molecules containing saturated ring systems. All input data, settings, and results are stored in an SQLite database, enabling interrupted jobs to be resumed. EasyDock integrates Dask for distributed computation across multiple machines. A built-in model predicts docking times to optimize task scheduling and reduce total runtime. Special cases, such as boron-containing molecules, are handled by temporarily substituting boron with carbon during the docking process. The ProLIF package is integrated to calculate protein-ligand interactions. The current version is composed entirely of open-source modules.
Respirium
poster
Wed-Thu
9:00 - 18:00
Advancing Drug-Likeness Prediction by Integrating PC and ADMET Parameters
Levon Kharatyan
Abstract
Drug discovery is expensive and time-consuming, so computational tools are used to help prioritize promising molecules. One such tool is drug-likeness, which evaluates whether a compound has features typical of successful drugs. Early rules like Lipinski’s Rule of 5 and later additions (e.g., Weber’s, Ghose’s, and the Golden Triangle) offered basic filters. In 2012, QED introduced a scoring system based on eight chemical properties, but it sometimes fails to clearly separate good drug candidates from poor ones.
To address this, we present HADES (Holistic ADMET-based Drug-likeness Estimation Score), a new model that combines chemical and ADMET properties of molecules to provide a more accurate measure of drug-likeness using 231 features. HADES is a stacking ensemble model composed of five algorithms: Random Forest, CatBoost, LightGBM, HistGradientBoosting, and XGBoost. The final score is the simple average of the predictions from these models.
HADES outperforms existing methods across multiple benchmarks. It assigns progressively higher scores to molecules across clinical trial phases (from preclinical to Phase 4), aligning with real-world drug development. In structure-activity relationship (SAR) studies, HADES tends to favor improved analogs, reflecting medicinal chemistry intuition.
We also tested HADES on newly approved oral drugs, where it assigned consistently high scores. In screening libraries for three specific drug targets, it achieved up to a 41% enrichment factor. Stress tests confirmed its reliability: HADES consistently gave lower scores to known small-ring-containing molecules that are clearly non-drug-like, toxic compounds, and chemically ‘odd’ structures, including benzene and its analogs, as well as their derivatives. It also performed well on acute oral toxicity datasets in mice.
Overall, HADES offers a practical, reliable tool for early drug screening, helping researchers focus on compounds with the best chance of success.
Respirium
poster
Wed-Thu
9:00 - 18:00
scikit-fingerprints = RDKit + scikit-learn for accelerated ML in chemoinformatics
Jakub Adamczyk
Abstract
scikit-fingerprints (https://github.com/scikit-fingerprints/scikit-fingerprints) is a scikit-learn compatible library for computing molecular fingerprints, molecular filters, distance measures, applicability domain algorithms, and more, on top of RDKit. It accelerates machine learning (ML) workflows in chemoinformatics by integrating those two software ecosystems, offering a unified, Pythonic interface over RDKit functionalities. It is the most mature project of this type, featuring the widest functionality range, distribution with PyPI, and a comprehensive documentation. Furthermore, it has been applied to multiple projects, including e.g. peptide property prediction, agrochemistry, and integration into BayBE experiment design framework.
Respirium
poster
Wed-Thu
9:00 - 18:00
Targeted and Technology-driven profiling of the Molport Chemical Space
Andrea Altieri
Abstract
The design and development of compound libraries have undergone a remarkable evolution, paralleling advances in medicinal chemistry and high-throughput screening technologies. From the early days of simple aggregations of available chemicals (sourced from academia or historical compound collections) mto the era of combinatorial chemistry, the field has steadily shifted toward more purpose-driven collections. Modern libraries now emphasize drug-like properties, target-oriented design, natural product-inspired scaffolds, and structural diversity to increase the likelihood of identifying biologically relevant hits with better potential to progress into clinical candidates.
This talk will trace this progression, highlighting how the philosophy behind library design has matured—from quantity-driven approaches to quality-focused strategies. As screening demands have grown, so too has the need to aggregate vast chemical inventories from multiple suppliers, ensuring both compound availability and consistency in data. This necessity gave rise to MolPort, a publicly available centralized platform built to unify global compound availability. With over 5 million unique small molecules sourced from numerous providers, MolPort represents a comprehensive, real-world compound database specifically geared for screening applications.
We will explore how such large-scale aggregation not only facilitates efficient compound sourcing but also enables in-depth cheminformatics analyses. Using the MolPort database as a case study, we will examine its drug-like property distributions, structural diversity metrics, and chemical space coverage. These analyses demonstrate how aggregated libraries can be both broad and balanced, bridging synthetic accessibility, commercial availability, and medicinal chemistry relevance.
Ultimately, this talk will underscore the strategic importance of well-curated compound libraries in modern drug discovery—and how platforms like MolPort are transforming the way screening campaigns are conceived, designed, and executed.
We have developed a novel retrosynthesis approach that constructs the synthesis tree in reverse—not from the target molecule to buyable building blocks, but starting from the building blocks and working forward. This is achieved by enumerating the space of possible synthons and identifying combinations that can be assembled into the target molecule. By focusing solely on complete retrosynthetic routes, our method avoids spending computational resources on incomplete or unproductive pathways.
On the USPTO-50K benchmark, our approach achieves a top-1 accuracy of 75.6%. It is important to note that the method implicitly restricts the search space to a predefined database of building blocks, rather than exploring all theoretically possible precursors.
Respirium
poster
Wed-Thu
9:00 - 18:00
De novo generation based on CReM framework and 3D pharmacophores
Dinesh Kumar Sriramulu
Abstract
We will present the tool for de novo generation guided by 3D pharmacophores, where starting fragments grow using CReM framework to match all pharmacophore features of a query. Unlike other tools our approach explicitly generates conformers which match a pharmacophore model. The generated structure have high docking scores and favorable drug-like properties, which can be controlled by a user. Synthetic accessibility of generated structures is flexibly controlled by CReM settings.