The earliest stage of drug discovery is governed by a simple constraint: there are far more possible drug-like molecules than any pharmaceutical laboratory could ever test. A new deep learning system, reported in the International Journal of Reasoning-based Intelligent Systems, offers a way to speed up research and could unblock industry bottlenecks.
Bringing a new pharmaceutical to market can take more than a decade and will inevitably cost billions of dollars in research and development, testing, regulatory compliance, and marketing. A large share of that investment is spent on identifying compounds that bind to biological targets, these are commonly proteins involved in disease, whether a protein found in a pathogen or a protein in our bodies involved in the disease. Virtual screening, so-called in silico studies, has for decades used computer models to predict which molecules from a library of candidates might be suitable for testing in vitro (in the laboratory) and ultimately in vivo (in animals, then humans).
That said, established methods fall into two categories. The first are receptor-based approaches, such as molecular docking, that simulate how a molecule fits into a protein’s three-dimensional binding site and estimate the strength of the bond that forms between. The accuracy of this approach depends on high-quality protein structures and simplified scoring formulae. A second approach is the ligand-based approach and this instead looks for compounds resembling known active molecules, using predefined chemical features, or descriptors.
These techniques can be computationally efficient and heavily successfully led to many pharmaceuticals on the market today. However, they rely heavily on prior knowledge and expert assumptions. In both cases, human-designed rules limit how much chemical complexity can be captured. The advent of deep learning systems is opening up a new approach.
Instead of manual feature selection, deep learning, a form of machine learning that uses multi-layered neural networks to detect patterns directly from raw data, can treat drug candidate molecules as graphs, with atoms as nodes and chemical bonds as edges. A graph neural network updates each atom’s representation based on its neighbours, allowing the model to learn subtle structural relationships.
Crucially, this new approach uses another information channel in addition to the graph. It handles the drug candidate’s SMILES string. A SMILES string is a unique text-based representation of the chemical structure of a molecule. By using structural and sequential representations together, the researchers could improve performance significantly. In tests on standard public benchmarks, the model achieved a score of 0.889; where 1.000 would be a perfect score. This score is a measure of how well the system distinguishes between active and inactive drug candidates. A score of 1 is ideal prediction whereas 0.5 reflects a 50:50 chance, a guess. Incredibly, the system could screen one million molecules in a quarter of an hour, which is 80 per cent faster than conventional approaches.
Zhang, C. (2026) ‘Deep learning-based virtual screening system for drug molecules’, Int. J. Reasoning-based Intelligent Systems, Vol. 18, No. 8, pp.44–55.
