Harnessing Machine Learning for the Virtual Screening of Natural Compounds as Both EGFR and HER2 Inhibitors in Colorectal Cancer: A Novel Therapeutic Approach.
1/5 보강
Colorectal cancer (CRC) is a type of cancer that affects the colon and rectum, with overexpression of epidermal growth factor receptor (EGFR) and human epidermal growth factor receptor 2 (HER2) observ
APA
Oku DNT, Babatunde DD, et al. (2025). Harnessing Machine Learning for the Virtual Screening of Natural Compounds as Both EGFR and HER2 Inhibitors in Colorectal Cancer: A Novel Therapeutic Approach.. ACS omega, 10(47), 57365-57384. https://doi.org/10.1021/acsomega.5c07683
MLA
Oku DNT, et al.. "Harnessing Machine Learning for the Virtual Screening of Natural Compounds as Both EGFR and HER2 Inhibitors in Colorectal Cancer: A Novel Therapeutic Approach.." ACS omega, vol. 10, no. 47, 2025, pp. 57365-57384.
PMID
41358077 ↗
Abstract 한글 요약
Colorectal cancer (CRC) is a type of cancer that affects the colon and rectum, with overexpression of epidermal growth factor receptor (EGFR) and human epidermal growth factor receptor 2 (HER2) observed in up to 85% of colorectal cancer cases. Although CRC treatment has progressed with the introduction of targeted drugs, current approaches have primarily focused on a single blockage of EGFR or HER2 to combat colon cancer. However, monotherapies that target either the EGFR or HER2 receptor frequently have low efficacy due to mutations in downstream effectors such as Kirsten Rat Sarcoma 2 Viral Oncogene Homologue (KRAS) and the activation of compensatory signaling pathways that support tumor survival and proliferation. Hence, the discovery and development of a therapy with the capability to combat CRC by simultaneously inhibiting both EGFR and HER2 remain avenues for further exploration. This study introduces a novel machine learning (ML)-based stacking ensemble framework for rapidly and accurately identifying dual EGFR and HER2 inhibitors using SMILES notation. A benchmark data set comprising active and inactive compounds against EGFR and HER2 was curated from the ChEMBL database. Based on this data set, 40 baseline models were developed and optimized using a comprehensive set of well-known molecular descriptors and ML algorithms (Figure 1). These models generated probabilistic features integrated via a logistic regression model as a final estimate to construct the final stacking ensemble model. The model was applied to bioactive compounds identified through LC-MS/MS profiling of extract as well as to a subset of natural products from the LOTUS database for virtual screening. The cytotoxic potential of was experimentally validated using the MTT assay against HCT116 colorectal cancer cells and noncancerous Vero cells, where the extract exhibited an IC value of 13.32 ± 1.09 μg/mL against HCT116 cells, indicating its significant anticancer potential. Additionally, molecular docking and ADMET studies were conducted on the top three compounds from both the LOTUS database and the predicted candidate from the LC-MS/MS data set identified by the stacking model, alongside four FDA-approved anticancer drugs for comparative analysis. Among these, LTS0131923 demonstrated the highest binding affinity against HER2 (PDB ID: 7MN5), with a binding energy of -11.2 kcal/mol and an inhibition constant of 0.00626 μM, outperforming Tucatinib, a standard CRC treatment. This study reveals the potential of ML-driven approaches to accelerate the discovery of dual-target inhibitors for CRC therapy and highlights as a promising source of bioactive compounds for cancer treatment.
📖 전문 본문 읽기 PMC JATS · ~129 KB · 영문
Introduction
1
Introduction
Colorectal cancer (CRC)
was reported in 2024 by the Global Cancer
Observatory (GLOBOCAN) as one of the leading causes of cancer-related
deaths globally. The recent advancement
of molecular-targeted drugs in CRC treatment, including anti-EGFR
antibodies, anti-VEGF antibodies, and HER2-targeting antibody drugs,
has improved the outcomes of the treatment; nonetheless, just a tiny
percentage of patients respond to these techniques. The epidermal growth factor receptor family ErbB (EGFR,
HER2/ErbB2, ErbB3, and ErbB4), which consists of receptor tyrosine
kinases (RTKs), is a well-known oncogenic driver and a therapeutic
target in several types of human cancer. They are widely expressed in epithelial tissues, where they play
significant roles in the regulation of cell survival, proliferation,
and differentiation through interlinked signal transduction involving
the activation of the PI3K/Akt and MAPK/ERK1/2 (mitogen-activated
protein (MAP) kinase/extracellular signal-regulated kinase (ERK)1/2)
pathways.
,
They function by forming dimeric signaling
units, which activate various oncogenic signaling pathways in cancer
cells. Over 80% of primary and metastatic colorectal cancers (CRCs)
are EGFR-positive.
Colorectal cancer
treatment has progressed with the introduction
of targeted drugs, particularly those that inhibit epidermal growth
factor receptor (EGFR) and human epidermal growth factor receptor
2 (HER2). Abnormal expression levels of HER2 and overexpression of
EGFR have been found in 2%–5% and 25%–77% of colorectal
cancers, respectively.
,
EGFR inhibitors, such as cetuximab
and panitumumab, hinder EGFR-mediated signaling, which is crucial
for tumor growth. However, innate or acquired resistance mechanisms,
such as mutations in downstream effectors like the Kirsten Rat Sarcoma
2 Viral Oncogene Homologue (KRAS) mutation or the activation of alternative
signaling pathways, often compromise their effectiveness.
In the same way, HER2 is another validated
target in colon cancer
therapy, especially when anti-EGFR treatments
are ineffective. It has been discovered
that HER2 amplification, a negative predictive biomarker, is known
to reduce the efficacy of therapies that target EGFR inhibitors.
,
Particularly in HER2-amplified cases, HER2-targeted therapies like
trastuzumab and lapatinib have demonstrated promise, and their dual
blockade has demonstrated efficacy in treating patients with KRAS
wild-type, HER2-positive metastatic colorectal cancer who are not
responding to standard treatments.
These treatments, however, have significant limitations. In colorectal
cancer, monotherapies that target EGFR or HER2 frequently have low
efficacy and only produce temporary improvements because of mutations
in downstream effectors like KRAS or the activation of compensatory
signaling pathways that support tumor survival and proliferation. Here, cancers also develop resistance to single-agent
therapy for a variety of reasons, which reduces the efficacy of long-term
treatment. The effectiveness of EGFR-
or HER2-targeted therapy is further limited by the diverse forms and
nature of CRC and the heterogeneity in receptor expression, which
makes it difficult to identify which individuals will benefit from
these treatments the most.
Addressing
these challenges requires new and innovative strategies
that can target multiple pathways simultaneously. In this context,
advances in computational technologies, particularly machine learning
methods, are widely applied in almost all stages of the discovery
and development of new drugs. Compared with conventional experimental
methods, these ML technologies provide more effective tools for screening
large libraries of compounds, predicting biological activity, and
enhancing therapeutic features. Machine learning has the ability to
learn from complex data sets and can easily discover patterns, which
makes it ideal for analyzing, predicting, and navigating the chemical
space of natural compounds, which are frequently overlooked in pharmaceutical
research despite their vast therapeutic value.
In this study, these advanced computational techniques
were utilized
to address the need for dual-target therapy in colorectal cancer by
focusing on both EGFR and HER2, with the goal of disrupting multiple
tumorigenic pathways, in contrast to previous studies that mostly
concentrated on single-target suppression. The intention was to conduct
a virtual screening of natural compounds that could potentially inhibit
both EGFR and HER2. By focusing on these natural compounds, we aimed
to tap into a reservoir of chemical diversity with often favorable
pharmacokinetic and toxicity profiles compared to synthetic molecules.
Additionally, the dual inhibition approach provides a strategic advantage
by potentially addressing the multifactorial nature of tumor growth
and resistance mechanisms, thereby overcoming the limitations of monotargeted
therapy.
The novel component of our study was to combine machine
learning
with an extensive cheminformatics library to project and classify
molecules that could potentially be effective dual inhibitors of EGFR/HER2
by collecting, cleaning, and processing compounds separately for EGFR
and HER2, and then combining them into one data set. A stacking ensemble
model was then built by using the merged data. This way, the model
learns features from both targets at the same time, helping us to
find compounds with the potential to inhibit both EGFR and HER2. This
strategy holds great potential for creating more potent and less harmful
colorectal cancer therapies in the future. Furthermore, it supports
the trend toward personalized treatment by providing information that
may result in more individualized and accurate cancer treatments,
depending on each patient’s unique genetic profile and the
features of their tumor. Our research attempts to establish a new
path in the discovery and development of chemotherapy by combining
powerful computational techniques with the abundant potential of natural
substances, ultimately leading to more efficient and long-term treatment
solutions.
Introduction
Colorectal cancer (CRC)
was reported in 2024 by the Global Cancer
Observatory (GLOBOCAN) as one of the leading causes of cancer-related
deaths globally. The recent advancement
of molecular-targeted drugs in CRC treatment, including anti-EGFR
antibodies, anti-VEGF antibodies, and HER2-targeting antibody drugs,
has improved the outcomes of the treatment; nonetheless, just a tiny
percentage of patients respond to these techniques. The epidermal growth factor receptor family ErbB (EGFR,
HER2/ErbB2, ErbB3, and ErbB4), which consists of receptor tyrosine
kinases (RTKs), is a well-known oncogenic driver and a therapeutic
target in several types of human cancer. They are widely expressed in epithelial tissues, where they play
significant roles in the regulation of cell survival, proliferation,
and differentiation through interlinked signal transduction involving
the activation of the PI3K/Akt and MAPK/ERK1/2 (mitogen-activated
protein (MAP) kinase/extracellular signal-regulated kinase (ERK)1/2)
pathways.
,
They function by forming dimeric signaling
units, which activate various oncogenic signaling pathways in cancer
cells. Over 80% of primary and metastatic colorectal cancers (CRCs)
are EGFR-positive.
Colorectal cancer
treatment has progressed with the introduction
of targeted drugs, particularly those that inhibit epidermal growth
factor receptor (EGFR) and human epidermal growth factor receptor
2 (HER2). Abnormal expression levels of HER2 and overexpression of
EGFR have been found in 2%–5% and 25%–77% of colorectal
cancers, respectively.
,
EGFR inhibitors, such as cetuximab
and panitumumab, hinder EGFR-mediated signaling, which is crucial
for tumor growth. However, innate or acquired resistance mechanisms,
such as mutations in downstream effectors like the Kirsten Rat Sarcoma
2 Viral Oncogene Homologue (KRAS) mutation or the activation of alternative
signaling pathways, often compromise their effectiveness.
In the same way, HER2 is another validated
target in colon cancer
therapy, especially when anti-EGFR treatments
are ineffective. It has been discovered
that HER2 amplification, a negative predictive biomarker, is known
to reduce the efficacy of therapies that target EGFR inhibitors.
,
Particularly in HER2-amplified cases, HER2-targeted therapies like
trastuzumab and lapatinib have demonstrated promise, and their dual
blockade has demonstrated efficacy in treating patients with KRAS
wild-type, HER2-positive metastatic colorectal cancer who are not
responding to standard treatments.
These treatments, however, have significant limitations. In colorectal
cancer, monotherapies that target EGFR or HER2 frequently have low
efficacy and only produce temporary improvements because of mutations
in downstream effectors like KRAS or the activation of compensatory
signaling pathways that support tumor survival and proliferation. Here, cancers also develop resistance to single-agent
therapy for a variety of reasons, which reduces the efficacy of long-term
treatment. The effectiveness of EGFR-
or HER2-targeted therapy is further limited by the diverse forms and
nature of CRC and the heterogeneity in receptor expression, which
makes it difficult to identify which individuals will benefit from
these treatments the most.
Addressing
these challenges requires new and innovative strategies
that can target multiple pathways simultaneously. In this context,
advances in computational technologies, particularly machine learning
methods, are widely applied in almost all stages of the discovery
and development of new drugs. Compared with conventional experimental
methods, these ML technologies provide more effective tools for screening
large libraries of compounds, predicting biological activity, and
enhancing therapeutic features. Machine learning has the ability to
learn from complex data sets and can easily discover patterns, which
makes it ideal for analyzing, predicting, and navigating the chemical
space of natural compounds, which are frequently overlooked in pharmaceutical
research despite their vast therapeutic value.
In this study, these advanced computational techniques
were utilized
to address the need for dual-target therapy in colorectal cancer by
focusing on both EGFR and HER2, with the goal of disrupting multiple
tumorigenic pathways, in contrast to previous studies that mostly
concentrated on single-target suppression. The intention was to conduct
a virtual screening of natural compounds that could potentially inhibit
both EGFR and HER2. By focusing on these natural compounds, we aimed
to tap into a reservoir of chemical diversity with often favorable
pharmacokinetic and toxicity profiles compared to synthetic molecules.
Additionally, the dual inhibition approach provides a strategic advantage
by potentially addressing the multifactorial nature of tumor growth
and resistance mechanisms, thereby overcoming the limitations of monotargeted
therapy.
The novel component of our study was to combine machine
learning
with an extensive cheminformatics library to project and classify
molecules that could potentially be effective dual inhibitors of EGFR/HER2
by collecting, cleaning, and processing compounds separately for EGFR
and HER2, and then combining them into one data set. A stacking ensemble
model was then built by using the merged data. This way, the model
learns features from both targets at the same time, helping us to
find compounds with the potential to inhibit both EGFR and HER2. This
strategy holds great potential for creating more potent and less harmful
colorectal cancer therapies in the future. Furthermore, it supports
the trend toward personalized treatment by providing information that
may result in more individualized and accurate cancer treatments,
depending on each patient’s unique genetic profile and the
features of their tumor. Our research attempts to establish a new
path in the discovery and development of chemotherapy by combining
powerful computational techniques with the abundant potential of natural
substances, ultimately leading to more efficient and long-term treatment
solutions.
Methodology
2
Methodology
2.1
Data Set Compilation
A data set of
21,991 unique compounds tested for activity against HER2 (Target ID:
CHEMBL1871) and EGFR (Target ID: CHEMBL203) was obtained from the
ChEMBL database. Of those, 7,165 molecules were tested for HER2, and
the remaining belonged to the EGFR inhibitors.
2.2
Data Preprocessing
All “standard_value”
(IC50 values) from the data set, which indicate the concentration
needed to inhibit a biological process by 50%, were standardized to
micromoles (μM). Entries that contained ambiguous bioactivity
values, specifically symbols like “<”, “>”,
and “/”, were removed because they typically indicate
approximate or range-bound values that could potentially skew analytical
results and model predictions, while those with the symbol “=”
in their “Standard Value” column were retained. Entries
in the standard_value and canonical_smiles that contained missing
information were eliminated.
2.3
Data Curation
Compounds with a bioactivity
unit of IC50 (half-maximal inhibitory concentration) were
selected to form the final data set, which consisted of 17,287 compounds.
These compounds were classified into two classes: compounds with IC50 ≤ 1 μM were considered active
(positive samples), while compounds with IC50 ≥ 10
μM were considered inactive (negative samples). The intermediate
values between 1 and 10 μM (1 ≤ IC50 ≤
10) were excluded from the data set to ensure a clear binary classification
between active and inactive compounds. To assess how the final trained
stacking model handles intermediate IC50 values (1–10
μM), a virtual screening on the excluded compounds labeled as
“class = 2” was performed, and the predicted probabilities
were analyzed. The pairwise Tanimoto coefficient was calculated for
all the compounds using the Molecular Access System (MACCS) fingerprint
implemented in the RDKit package, and
duplicates of the compounds (Tanimoto coefficient equal to 1) were
removed, and the average IC50 of all the copies was used.
2.4
Chemical Space Analysis
Eight physicochemical
properties that are essential for evaluating the compounds’
molecular complexity and drug-likeness for both the active and inactive
groups of the data set were computed, visualized, and compared. Molecular
weight (MW), the Ghose–Crippen–Viswanadhan octanol–water
partition coefficient (ALog P),
the number of hydrogen-bond acceptors (nHAcc), the
number of hydrogen-bond donors (nHDon), the aromatic
ratio (ARR), the number of rings (nCIC), the number
of rotatable bonds (RBN), and the number of benzene-like rings (nBnz) are among the properties listed in Lipinski’s
Rule of Five (Ro5). This was followed by an assessment of the mean,
median, minimum, and maximum values. The p values
derived from the Mann–Whitney U test (at the threshold of p < 0.001) were used to establish statistical significance.
2.5
Molecular Fingerprint
The machine
learning model was built using descriptors, such as molecular fingerprints,
and the settings for the fingerprints were based on the literature. The SMILES notation was used as a binary form
of molecular structure representation to obtain the structural features
of the compounds. RDKit (https://www.rdkit.org/) was used to implement the SMILES notation and calculate ten molecular
fingerprint descriptors (AP2D, CDK, CDKExtended, CDKGraph, KR, MACCS,
Circle, Estate, Hybrid, and PubChem). Features for virtual screening
were encoded using molecular fingerprints. For example, atom pairs
at various topological distances can be identified using the 2D Atom
Pair (AP2D) fingerprint, generating 780
distinct features. The CDK fingerprint,
on the other hand, creates a 2,048-bit representation based on a search
depth of eight. Building on this, the CDK Extended (CDKExt) version incorporates additional bits to capture
ring-specific characteristics, while CDK Graph Only (CKDGraph) simplifies the representation by focusing fully
on molecular connectivity while disregarding bond order. The Circle
fingerprint is another known technique
that encodes circular molecular features using a 2,048-bit structure.
Additionally, the EState fingerprint has
a distinct focus on atomic properties, capturing electrotopological
states with only 79 features. Hybrid fingerprints generate a 2,048-bit encoding by combining both hybridization
and data from CDK. With 2,048 chemical substructures encoded, the
Klekota–Roth (KR) fingerprint also
provides a detailed representation. While PubChem fingerprints incorporate 881 features generated from substructure
definitions in the PubChem database, MACCS fingerprints use 166 binary features specified by MACCS keys,
which are smaller but no less relevant. Each sort of fingerprint assists
the predictive models in virtual screening and is an essential tool
for accurately representing molecular structures.
2.6
Construction of Training and Independent Test
Data Sets
The final data set consisted of 6325 active and
2179 inactive compounds. Twenty percent of the compounds were randomly
sampled and set aside to create the independent test data set consisting
of 1265 active and 436 inactive compounds (named HER2-IND), and the
remaining 80%, consisting of 5060 active and 1743 inactive compounds,
was used for the construction of the training data set (named HER2-TRN).
In order to address class imbalance, sampling strategies (Random Undersampling,
SMOTE, and combined SMOTE + Undersampling) were integrated into the
model training pipeline. These were applied during cross-validation
and consistently evaluated across the different classifiers.
2.7
Machine Learning Model Development and Evaluation
The prepared database was utilized to build ML models after fingerprint
descriptors were generated with RDKit AllChem software. The baseline
development and stacking ensemble phases were the two primary stages
of this study. In the baseline model development stage, 11 ML algorithms
and 10 molecular fingerprint descriptors were employed to develop
these baseline models. The following popular ML algorithms, including
Random Forest (RF), AdaBoost (ADA), Light Gradient Boosting Machine (LGBM), Multilayer Perceptron (MLP), Decision tree (DT), Extremely
Randomized Trees (ET), Extreme Gradient
Boosting (XGB),
k-Nearest
Neighbor (KNN), Logistic Regression (LR),
and Support Vector Machine (SVM) combined
with linear (SVMLN) and radial basis function (SVMRBF) kernels, were
considered for the construction of baseline models. Among all the
models evaluated, those with the best cross-validation Matthew’s
Correlation Coefficient (MCC) performance were selected for the construction
of the baseline models. Specifically, four (RF, MLP, ET, LR) out of
the 11 ML algorithms were chosen based on their performance. Each
of these four algorithms was trained on each fingerprint type, resulting
in a total of 40 baseline models.
2.8
Stacking Ensemble Method
A stacking
learning technique was employed to create a stacking ensemble in this
phase. The method for the selection of the ML models required to construct
the 40 baselines was based on the top-performing single feature-based
models from the independent test data (HER2-IND) and HER2-TRN. After
obtaining 40 baselines, the 40 baseline models were then used to generate
a feature vector for each compound, which served as input to construct
the meta-model in the stacking ensemble. For any given compound Q,
each of the 40 baseline models outputs a predicted probability (PF),
which ranges from 0 to 1. These individual probabilities are combined
to form a 40-dimensional feature vector, denoted as FV(Q).
This feature vector can be represented as FV(Q)= {PFBM1,1, PFBM1,1, ··· ,
PFBMi, j, ··· , PFBM4,10}.
where PFBM
i, j
is the predicted
probability (PF) from the baseline model (BM) that was trained using
the ith machine learning algorithm and the jth fingerprint descriptor, i∈ {1,
2, 3, 4} represents the four selected ML models (RF, MLP, ET, LR),
and j∈ {1, 2, ··· ,10} corresponds
to the ten molecular fingerprinting methods. To construct the stacking
method, the output from our baseline model development was utilized
as the new training data for the ensemble. The stacking ensemble was
created by combining the predictions of these models using the Logistic
Regression model as a final estimator. The stacking process used 5-fold
cross-validation to prevent overfitting and offer dependable training.
The model was trained on both HER2-TRN and HER2-IND data sets using
parallel processing to speed up. The model generated predictions for
both the training and test sets, including their class labels and
probabilities. Its performance was then evaluated using key metrics
such as accuracy (ACC), F1 score, sensitivity (Sn),
specificity (Sp), Matthew’s Correlation Coefficient (MCC),
and Area Under the Curve (AUC). The stacking ensemble was constructed
using predictions from single feature-based models trained on the
balanced subsets obtained from random undersampling, SMOTE, and combined
sampling, enabling improved detection of active compounds without
introducing excessive variance.
2.9
Dual-Target Data Set Integration and Model
Development
The two different data sets of bioactive compounds,
one targeting EGFR and the other targeting HER2, were collected and
curated separately. After standardizing and preprocessing each data
set independently, they were merged to create a combined data set
for model training. This unified data set was used to train the stacking
ensemble model that learned patterns associated with both EGFR and
HER2 inhibition. By integrating both targets into a single predictive
stacking framework, we aimed to enhance the model’s ability
to identify compounds with potential dual-inhibitory activity. The
stacking ensemble model was validated on the HER2-IND data set. In
order to prevent overfitting and data leakage, the test set (HER2-IND)
was strictly isolated during the model training and feature selection.
All preprocessing steps, including fingerprint generation and scaling,
were applied only to the training data (HER2-TRN), and the same transformations
were then used on the test data. This demonstrated superior predictive
performance across all evaluation metrics on the test data set.
2.10
Model Application
The model was
applied in two main instances. First, it was used to screen a data
set obtained from a natural product database (LOTUS). Second, it was
applied to phytochemical compounds identified from the pod extract
of Ceratonia siliqua
L. Liquid chromatography-tandem mass spectrometry (LC-MS/MS) profiling
was employed to identify the phytochemical compounds present in the
carob pod extract. The identified compounds were then subjected to
virtual screening using the developed stacking ensemble machine learning
model.
2.11
Anticancer Activity of Ceratonia
siliqua L. Extract
2.11.1
Plant Material Collection and Identification
Fresh samples of Ceratonia siliqua
L. pods were collected in November 2024 from the
Kazimingi Nursery Farm in Benoni (S26° 04′ 30.36″
| E28° 19′ 32.99″), located in the Gauteng province
of South Africa. The specimen was identified by Dr. R. Munyai at the
College of Agriculture and Environmental Sciences (CAES) Horticulture
Research Center, University of South Africa. A voucher specimen with
the number UNISA/CAES/2024/SB0054 has been deposited at the UNISA
herbarium for reference purposes. The plant specimens were air-dried
indoors at around 20 °C for 14 days with occasional turning and
then ground into a fine powder.
2.11.2
Extraction
Extraction from the
carob pod sample was conducted using a cold maceration method, where
250 g of powdered pods were macerated in a 1:1 ethanol–water
mixture for 72 h. The mixture was filtered, and the filtrate was concentrated
under vacuum using a rotary evaporator at 60–70 °C. The
resulting extract was air dried to a constant weight and stored in
preweighed glass vials at 4 °C in the dark until use.
2.11.3
MTT Cytotoxicity Assay
The MTT
(3-[4,5-dimethylthiazol-2-yl]-2,5-diphenyl tetrazolium bromide) cell
proliferation assay involves the conversion of MTT into formazan crystals
by living cells. The reduction of MTT to formazan by mitochondrial
succinate dehydrogenase enzyme activity is the hallmark of measuring
cell viability or death. The assay was conducted following a slightly
modified version of the method described by van Meerloo et al. (2011). Human colorectal cells (HCT116) were seeded
into a 96-well plate at 1 × 104 cells/well and incubated
for 24 hours at 37 °C in 5.0% CO2 saturation
to allow attachment prior to treatment. The cells were treated with
various gradients of extract concentrations ranging from 100 to 3.125 μg/mL,
while doxorubicin was used as a positive control with a concentration
range of 100–3.125 μg/mL. The 96-well plates were
incubated for 24 hours, after which MTT (5 mg/mL) was
added, and the plates were incubated for a further hour. Dimethyl
sulfoxide (DMSO) was used as a solubilizing agent to dissolve the
formazan, and the plates were spectrophotometrically measured at a
wavelength of 570 nm using a microplate reader (Varioskan Flash,
Thermo Fisher Scientific). The percentage of cell viability was calculated
using the formula below:where AT and AC represent the absorbance of
the extract-treated cells and untreated control cells, respectively.
2.12
LC-MS/MS Analysis
2.12.1
Instrumentation
The compounds
were extracted using a Thermo Scientific Q Exactive Plus Hybrid Quadrupole-Orbitrap
Mass Spectrometer coupled to a Thermo Scientific Dionex UltiMate 3000
UHPLC system (Thermo Fisher Scientific, Waltham, MA, USA). Detection
was carried out using the Exactive Plus LC-MS/MS equipped with a heated
electrospray ionization (HESI) source in multiple reaction monitoring
(MRM) mode. Source parameters were optimized to suit the HPLC flow
rate, including a capillary temperature of 290 °C, a spray voltage
of 3.5 kV, a sheath gas flow set to 50 arbitrary units, and
an auxiliary gas temperature of 400 °C. Data acquisition was
performed in full MS, single ion monitoring (SIM), and all-ion fragmentation
(AIF) modes, using positive ionization within an m/z range of 80 to 750, maintaining a mass accuracy
within 5 ppm. The instrument operated at a resolution of 70,000 full
width at half-maximum (fwhm), with an automatic gain control (AGC)
target of 1.0 × 106 and a maximum injection time of
100 ms. Mass calibration for accuracy was carried out using Thermo
Scientific Pierce calibration standards for both positive and negative
ion modes. Instrument control, data collection, and processing were
achieved using the Thermo Scientific TraceFinder (version 5.1) software.
2.12.2
Chromatographic Procedure for Identification
of the Compounds
The concentrated extract of Ceratonia siliqua L (1 g) was first dissolved in
100 mL of water and acetonitrile (1:1 v/v) and then filtered using
a 0.45 μm PVDF membrane filter prior to analysis.
The
components of the extracts were separated on an XTerra MS C18 (100
× 4.6 mm, 3.5 μm) analytical column (Waters Corporation,
Milford, MA, USA). The mobile phase consisted of 0.1% (v/v) formic
acid in water as mobile phase A and 0.1% (v/v) formic acid in acetonitrile
as mobile phase B. The optimum mobile phase composition was as follows:
0–3 min, 2%–5% B; 3–10 min, 5%–15% B;
10–16 min, 15%–30% B; 16–22 min, 30%–60%
B; 22–26 min, 60%–90% B; 26–28 min, 90%–5%
B; and 28–32 min, 5%–5% B. The flow rate was set at
0.9 mL/min, the column temperature was set at a constant 25 °C,
and the injection volume was 5 μL.
The acquired LC-MS/MS
data were processed and analyzed using the
Global Natural Products Social Molecular Networking (GNPS 2.0) platform. The network topology and library search parameters
were adjusted to align with the default settings of the natural product
workflow. The workflow included background subtraction using blank
data, feature detection, elemental composition determination, library
matching, and an MS/MS fragmentation search. The identification of
compounds was primarily based on MS/MS spectral matching against the
following libraries: MoNA, MSNLIB-positive, MSNLIB-negative, NEO-MSMS,
Berkeley-Lab, Birmingham-UHPLC-MS-POS, MASSBANK, GNPS-NIST14-MATCHES,
and GNPS-NIH.
2.13
Virtual Screening of LC-MS/MS-Identified
Compounds from Ceratonia siliqua
L.
The data set for virtual screening was a combination
of PubChem IDs and SMILES representations of the identified compounds
obtained from LC-MS/MS analysis. The built stacking ensemble was used
to virtually screen and compute the probability of the activity of
the compounds against both HER2 and EGFR receptors in the treatment
of colorectal cancer. The model was then applied to four standard
FDA-approved drugs (Doxorubicin, Abemaciclib, Gemcitabine, and Tucatinib).
2.14
Virtual Screening of the LOTUS Database
The data set for virtual screening was a combination of ID and
SMILES data, available in the Lotus Natural Products (LOTUS) database.
Out of the 271,000 natural compounds in the LOTUS database, 1400 were
randomly chosen. The built stacking ensemble was used to virtually
screen and compute the probability of the activity of 1400 compounds
against both HER2 and EGFR.
2.15
Molecular Docking
The identification
of potent ligands against both HER2 and EGFR is critical for advancing
targeted cancer therapies. This study employed a developed stacking
ensemble algorithm to select four promising ligands (NCGC00385704–01,
LTS0018034, LTS0018035, and LTS0131924). These ligands and four FDA-approved
drugs (Doxorubicin, Abemaciclib, Gemcitabine, and Tucatinib) obtained
from the PubChem database, underwent energy minimization using the
MM2 force field in Chem3D (PerkinElmer Informatics, 2020) and were
converted to PDB format for molecular docking simulations. The molecular
docking simulations assessed binding affinity and molecular interactions
with HER2 (7MN5, resolution 2.93 Å) and EGFR (7ZYM, resolution 2.50 Å), with their crystal structures retrieved from the RCSB Protein Data
Bank. The downloaded protein crystal
structures were prepared using Discovery Studio Visualizer (v19.1.0.18287)
by removing heteroatoms and water molecules, followed by energy minimization
before docking. AutoDock Vina, incorporated within PyRx, was utilized
for the docking simulation. Grid parameters were centered within the
reported receptors’ binding pockets as described in the literature.
Meanwhile, the docking outputs were examined through RMSD docking
score calculations and were visually analyzed to obtain atomic interactions
using Discovery Studio for 2D interaction maps, while ChimeraX 1.9
was used to obtain 3D representations.
2.16
In Silico ADMET Analysis
The pharmacokinetic properties (physicochemical analysis, absorption,
distribution, metabolism, and excretion) of the top-hit compounds
from virtual screening and selected FDA-approved reference drugs were
evaluated using the Swiss ADME (http://www.swissadme.ch) and ADMET Lab 2.0 (https://admetmesh.scbdd.com) web platforms. The physicochemical parameters depend on factors,
including molecular weight, number of hydrogen bond acceptors, hydrogen
bond donors, number of rotatable bonds, lipophilicity, solubility,
and topological surface area.
Lipinski’s Rule of Five
was used to evaluate the drug-likeness of the selected compounds.
Key absorption parameters include Caco-2 cell permeability and human
intestinal absorption (HIA). Excretion properties were predicted based
on clearance rate and biological half-life. The SMILES representations
of the reference compounds and active compounds from the LC-MS/MS
analysis were retrieved from PubChem (https://pubchem.ncbi.nlm.nih.gov), and those of the top hit compounds were obtained from the LOTUS
database. The pharmacokinetic properties were determined by submitting
the SMILES strings to the Swiss ADME and ADMET Lab 2.0 web servers.
Methodology
2.1
Data Set Compilation
A data set of
21,991 unique compounds tested for activity against HER2 (Target ID:
CHEMBL1871) and EGFR (Target ID: CHEMBL203) was obtained from the
ChEMBL database. Of those, 7,165 molecules were tested for HER2, and
the remaining belonged to the EGFR inhibitors.
2.2
Data Preprocessing
All “standard_value”
(IC50 values) from the data set, which indicate the concentration
needed to inhibit a biological process by 50%, were standardized to
micromoles (μM). Entries that contained ambiguous bioactivity
values, specifically symbols like “<”, “>”,
and “/”, were removed because they typically indicate
approximate or range-bound values that could potentially skew analytical
results and model predictions, while those with the symbol “=”
in their “Standard Value” column were retained. Entries
in the standard_value and canonical_smiles that contained missing
information were eliminated.
2.3
Data Curation
Compounds with a bioactivity
unit of IC50 (half-maximal inhibitory concentration) were
selected to form the final data set, which consisted of 17,287 compounds.
These compounds were classified into two classes: compounds with IC50 ≤ 1 μM were considered active
(positive samples), while compounds with IC50 ≥ 10
μM were considered inactive (negative samples). The intermediate
values between 1 and 10 μM (1 ≤ IC50 ≤
10) were excluded from the data set to ensure a clear binary classification
between active and inactive compounds. To assess how the final trained
stacking model handles intermediate IC50 values (1–10
μM), a virtual screening on the excluded compounds labeled as
“class = 2” was performed, and the predicted probabilities
were analyzed. The pairwise Tanimoto coefficient was calculated for
all the compounds using the Molecular Access System (MACCS) fingerprint
implemented in the RDKit package, and
duplicates of the compounds (Tanimoto coefficient equal to 1) were
removed, and the average IC50 of all the copies was used.
2.4
Chemical Space Analysis
Eight physicochemical
properties that are essential for evaluating the compounds’
molecular complexity and drug-likeness for both the active and inactive
groups of the data set were computed, visualized, and compared. Molecular
weight (MW), the Ghose–Crippen–Viswanadhan octanol–water
partition coefficient (ALog P),
the number of hydrogen-bond acceptors (nHAcc), the
number of hydrogen-bond donors (nHDon), the aromatic
ratio (ARR), the number of rings (nCIC), the number
of rotatable bonds (RBN), and the number of benzene-like rings (nBnz) are among the properties listed in Lipinski’s
Rule of Five (Ro5). This was followed by an assessment of the mean,
median, minimum, and maximum values. The p values
derived from the Mann–Whitney U test (at the threshold of p < 0.001) were used to establish statistical significance.
2.5
Molecular Fingerprint
The machine
learning model was built using descriptors, such as molecular fingerprints,
and the settings for the fingerprints were based on the literature. The SMILES notation was used as a binary form
of molecular structure representation to obtain the structural features
of the compounds. RDKit (https://www.rdkit.org/) was used to implement the SMILES notation and calculate ten molecular
fingerprint descriptors (AP2D, CDK, CDKExtended, CDKGraph, KR, MACCS,
Circle, Estate, Hybrid, and PubChem). Features for virtual screening
were encoded using molecular fingerprints. For example, atom pairs
at various topological distances can be identified using the 2D Atom
Pair (AP2D) fingerprint, generating 780
distinct features. The CDK fingerprint,
on the other hand, creates a 2,048-bit representation based on a search
depth of eight. Building on this, the CDK Extended (CDKExt) version incorporates additional bits to capture
ring-specific characteristics, while CDK Graph Only (CKDGraph) simplifies the representation by focusing fully
on molecular connectivity while disregarding bond order. The Circle
fingerprint is another known technique
that encodes circular molecular features using a 2,048-bit structure.
Additionally, the EState fingerprint has
a distinct focus on atomic properties, capturing electrotopological
states with only 79 features. Hybrid fingerprints generate a 2,048-bit encoding by combining both hybridization
and data from CDK. With 2,048 chemical substructures encoded, the
Klekota–Roth (KR) fingerprint also
provides a detailed representation. While PubChem fingerprints incorporate 881 features generated from substructure
definitions in the PubChem database, MACCS fingerprints use 166 binary features specified by MACCS keys,
which are smaller but no less relevant. Each sort of fingerprint assists
the predictive models in virtual screening and is an essential tool
for accurately representing molecular structures.
2.6
Construction of Training and Independent Test
Data Sets
The final data set consisted of 6325 active and
2179 inactive compounds. Twenty percent of the compounds were randomly
sampled and set aside to create the independent test data set consisting
of 1265 active and 436 inactive compounds (named HER2-IND), and the
remaining 80%, consisting of 5060 active and 1743 inactive compounds,
was used for the construction of the training data set (named HER2-TRN).
In order to address class imbalance, sampling strategies (Random Undersampling,
SMOTE, and combined SMOTE + Undersampling) were integrated into the
model training pipeline. These were applied during cross-validation
and consistently evaluated across the different classifiers.
2.7
Machine Learning Model Development and Evaluation
The prepared database was utilized to build ML models after fingerprint
descriptors were generated with RDKit AllChem software. The baseline
development and stacking ensemble phases were the two primary stages
of this study. In the baseline model development stage, 11 ML algorithms
and 10 molecular fingerprint descriptors were employed to develop
these baseline models. The following popular ML algorithms, including
Random Forest (RF), AdaBoost (ADA), Light Gradient Boosting Machine (LGBM), Multilayer Perceptron (MLP), Decision tree (DT), Extremely
Randomized Trees (ET), Extreme Gradient
Boosting (XGB),
k-Nearest
Neighbor (KNN), Logistic Regression (LR),
and Support Vector Machine (SVM) combined
with linear (SVMLN) and radial basis function (SVMRBF) kernels, were
considered for the construction of baseline models. Among all the
models evaluated, those with the best cross-validation Matthew’s
Correlation Coefficient (MCC) performance were selected for the construction
of the baseline models. Specifically, four (RF, MLP, ET, LR) out of
the 11 ML algorithms were chosen based on their performance. Each
of these four algorithms was trained on each fingerprint type, resulting
in a total of 40 baseline models.
2.8
Stacking Ensemble Method
A stacking
learning technique was employed to create a stacking ensemble in this
phase. The method for the selection of the ML models required to construct
the 40 baselines was based on the top-performing single feature-based
models from the independent test data (HER2-IND) and HER2-TRN. After
obtaining 40 baselines, the 40 baseline models were then used to generate
a feature vector for each compound, which served as input to construct
the meta-model in the stacking ensemble. For any given compound Q,
each of the 40 baseline models outputs a predicted probability (PF),
which ranges from 0 to 1. These individual probabilities are combined
to form a 40-dimensional feature vector, denoted as FV(Q).
This feature vector can be represented as FV(Q)= {PFBM1,1, PFBM1,1, ··· ,
PFBMi, j, ··· , PFBM4,10}.
where PFBM
i, j
is the predicted
probability (PF) from the baseline model (BM) that was trained using
the ith machine learning algorithm and the jth fingerprint descriptor, i∈ {1,
2, 3, 4} represents the four selected ML models (RF, MLP, ET, LR),
and j∈ {1, 2, ··· ,10} corresponds
to the ten molecular fingerprinting methods. To construct the stacking
method, the output from our baseline model development was utilized
as the new training data for the ensemble. The stacking ensemble was
created by combining the predictions of these models using the Logistic
Regression model as a final estimator. The stacking process used 5-fold
cross-validation to prevent overfitting and offer dependable training.
The model was trained on both HER2-TRN and HER2-IND data sets using
parallel processing to speed up. The model generated predictions for
both the training and test sets, including their class labels and
probabilities. Its performance was then evaluated using key metrics
such as accuracy (ACC), F1 score, sensitivity (Sn),
specificity (Sp), Matthew’s Correlation Coefficient (MCC),
and Area Under the Curve (AUC). The stacking ensemble was constructed
using predictions from single feature-based models trained on the
balanced subsets obtained from random undersampling, SMOTE, and combined
sampling, enabling improved detection of active compounds without
introducing excessive variance.
2.9
Dual-Target Data Set Integration and Model
Development
The two different data sets of bioactive compounds,
one targeting EGFR and the other targeting HER2, were collected and
curated separately. After standardizing and preprocessing each data
set independently, they were merged to create a combined data set
for model training. This unified data set was used to train the stacking
ensemble model that learned patterns associated with both EGFR and
HER2 inhibition. By integrating both targets into a single predictive
stacking framework, we aimed to enhance the model’s ability
to identify compounds with potential dual-inhibitory activity. The
stacking ensemble model was validated on the HER2-IND data set. In
order to prevent overfitting and data leakage, the test set (HER2-IND)
was strictly isolated during the model training and feature selection.
All preprocessing steps, including fingerprint generation and scaling,
were applied only to the training data (HER2-TRN), and the same transformations
were then used on the test data. This demonstrated superior predictive
performance across all evaluation metrics on the test data set.
2.10
Model Application
The model was
applied in two main instances. First, it was used to screen a data
set obtained from a natural product database (LOTUS). Second, it was
applied to phytochemical compounds identified from the pod extract
of Ceratonia siliqua
L. Liquid chromatography-tandem mass spectrometry (LC-MS/MS) profiling
was employed to identify the phytochemical compounds present in the
carob pod extract. The identified compounds were then subjected to
virtual screening using the developed stacking ensemble machine learning
model.
2.11
Anticancer Activity of Ceratonia
siliqua L. Extract
2.11.1
Plant Material Collection and Identification
Fresh samples of Ceratonia siliqua
L. pods were collected in November 2024 from the
Kazimingi Nursery Farm in Benoni (S26° 04′ 30.36″
| E28° 19′ 32.99″), located in the Gauteng province
of South Africa. The specimen was identified by Dr. R. Munyai at the
College of Agriculture and Environmental Sciences (CAES) Horticulture
Research Center, University of South Africa. A voucher specimen with
the number UNISA/CAES/2024/SB0054 has been deposited at the UNISA
herbarium for reference purposes. The plant specimens were air-dried
indoors at around 20 °C for 14 days with occasional turning and
then ground into a fine powder.
2.11.2
Extraction
Extraction from the
carob pod sample was conducted using a cold maceration method, where
250 g of powdered pods were macerated in a 1:1 ethanol–water
mixture for 72 h. The mixture was filtered, and the filtrate was concentrated
under vacuum using a rotary evaporator at 60–70 °C. The
resulting extract was air dried to a constant weight and stored in
preweighed glass vials at 4 °C in the dark until use.
2.11.3
MTT Cytotoxicity Assay
The MTT
(3-[4,5-dimethylthiazol-2-yl]-2,5-diphenyl tetrazolium bromide) cell
proliferation assay involves the conversion of MTT into formazan crystals
by living cells. The reduction of MTT to formazan by mitochondrial
succinate dehydrogenase enzyme activity is the hallmark of measuring
cell viability or death. The assay was conducted following a slightly
modified version of the method described by van Meerloo et al. (2011). Human colorectal cells (HCT116) were seeded
into a 96-well plate at 1 × 104 cells/well and incubated
for 24 hours at 37 °C in 5.0% CO2 saturation
to allow attachment prior to treatment. The cells were treated with
various gradients of extract concentrations ranging from 100 to 3.125 μg/mL,
while doxorubicin was used as a positive control with a concentration
range of 100–3.125 μg/mL. The 96-well plates were
incubated for 24 hours, after which MTT (5 mg/mL) was
added, and the plates were incubated for a further hour. Dimethyl
sulfoxide (DMSO) was used as a solubilizing agent to dissolve the
formazan, and the plates were spectrophotometrically measured at a
wavelength of 570 nm using a microplate reader (Varioskan Flash,
Thermo Fisher Scientific). The percentage of cell viability was calculated
using the formula below:where AT and AC represent the absorbance of
the extract-treated cells and untreated control cells, respectively.
2.12
LC-MS/MS Analysis
2.12.1
Instrumentation
The compounds
were extracted using a Thermo Scientific Q Exactive Plus Hybrid Quadrupole-Orbitrap
Mass Spectrometer coupled to a Thermo Scientific Dionex UltiMate 3000
UHPLC system (Thermo Fisher Scientific, Waltham, MA, USA). Detection
was carried out using the Exactive Plus LC-MS/MS equipped with a heated
electrospray ionization (HESI) source in multiple reaction monitoring
(MRM) mode. Source parameters were optimized to suit the HPLC flow
rate, including a capillary temperature of 290 °C, a spray voltage
of 3.5 kV, a sheath gas flow set to 50 arbitrary units, and
an auxiliary gas temperature of 400 °C. Data acquisition was
performed in full MS, single ion monitoring (SIM), and all-ion fragmentation
(AIF) modes, using positive ionization within an m/z range of 80 to 750, maintaining a mass accuracy
within 5 ppm. The instrument operated at a resolution of 70,000 full
width at half-maximum (fwhm), with an automatic gain control (AGC)
target of 1.0 × 106 and a maximum injection time of
100 ms. Mass calibration for accuracy was carried out using Thermo
Scientific Pierce calibration standards for both positive and negative
ion modes. Instrument control, data collection, and processing were
achieved using the Thermo Scientific TraceFinder (version 5.1) software.
2.12.2
Chromatographic Procedure for Identification
of the Compounds
The concentrated extract of Ceratonia siliqua L (1 g) was first dissolved in
100 mL of water and acetonitrile (1:1 v/v) and then filtered using
a 0.45 μm PVDF membrane filter prior to analysis.
The
components of the extracts were separated on an XTerra MS C18 (100
× 4.6 mm, 3.5 μm) analytical column (Waters Corporation,
Milford, MA, USA). The mobile phase consisted of 0.1% (v/v) formic
acid in water as mobile phase A and 0.1% (v/v) formic acid in acetonitrile
as mobile phase B. The optimum mobile phase composition was as follows:
0–3 min, 2%–5% B; 3–10 min, 5%–15% B;
10–16 min, 15%–30% B; 16–22 min, 30%–60%
B; 22–26 min, 60%–90% B; 26–28 min, 90%–5%
B; and 28–32 min, 5%–5% B. The flow rate was set at
0.9 mL/min, the column temperature was set at a constant 25 °C,
and the injection volume was 5 μL.
The acquired LC-MS/MS
data were processed and analyzed using the
Global Natural Products Social Molecular Networking (GNPS 2.0) platform. The network topology and library search parameters
were adjusted to align with the default settings of the natural product
workflow. The workflow included background subtraction using blank
data, feature detection, elemental composition determination, library
matching, and an MS/MS fragmentation search. The identification of
compounds was primarily based on MS/MS spectral matching against the
following libraries: MoNA, MSNLIB-positive, MSNLIB-negative, NEO-MSMS,
Berkeley-Lab, Birmingham-UHPLC-MS-POS, MASSBANK, GNPS-NIST14-MATCHES,
and GNPS-NIH.
2.13
Virtual Screening of LC-MS/MS-Identified
Compounds from Ceratonia siliqua
L.
The data set for virtual screening was a combination
of PubChem IDs and SMILES representations of the identified compounds
obtained from LC-MS/MS analysis. The built stacking ensemble was used
to virtually screen and compute the probability of the activity of
the compounds against both HER2 and EGFR receptors in the treatment
of colorectal cancer. The model was then applied to four standard
FDA-approved drugs (Doxorubicin, Abemaciclib, Gemcitabine, and Tucatinib).
2.14
Virtual Screening of the LOTUS Database
The data set for virtual screening was a combination of ID and
SMILES data, available in the Lotus Natural Products (LOTUS) database.
Out of the 271,000 natural compounds in the LOTUS database, 1400 were
randomly chosen. The built stacking ensemble was used to virtually
screen and compute the probability of the activity of 1400 compounds
against both HER2 and EGFR.
2.15
Molecular Docking
The identification
of potent ligands against both HER2 and EGFR is critical for advancing
targeted cancer therapies. This study employed a developed stacking
ensemble algorithm to select four promising ligands (NCGC00385704–01,
LTS0018034, LTS0018035, and LTS0131924). These ligands and four FDA-approved
drugs (Doxorubicin, Abemaciclib, Gemcitabine, and Tucatinib) obtained
from the PubChem database, underwent energy minimization using the
MM2 force field in Chem3D (PerkinElmer Informatics, 2020) and were
converted to PDB format for molecular docking simulations. The molecular
docking simulations assessed binding affinity and molecular interactions
with HER2 (7MN5, resolution 2.93 Å) and EGFR (7ZYM, resolution 2.50 Å), with their crystal structures retrieved from the RCSB Protein Data
Bank. The downloaded protein crystal
structures were prepared using Discovery Studio Visualizer (v19.1.0.18287)
by removing heteroatoms and water molecules, followed by energy minimization
before docking. AutoDock Vina, incorporated within PyRx, was utilized
for the docking simulation. Grid parameters were centered within the
reported receptors’ binding pockets as described in the literature.
Meanwhile, the docking outputs were examined through RMSD docking
score calculations and were visually analyzed to obtain atomic interactions
using Discovery Studio for 2D interaction maps, while ChimeraX 1.9
was used to obtain 3D representations.
2.16
In Silico ADMET Analysis
The pharmacokinetic properties (physicochemical analysis, absorption,
distribution, metabolism, and excretion) of the top-hit compounds
from virtual screening and selected FDA-approved reference drugs were
evaluated using the Swiss ADME (http://www.swissadme.ch) and ADMET Lab 2.0 (https://admetmesh.scbdd.com) web platforms. The physicochemical parameters depend on factors,
including molecular weight, number of hydrogen bond acceptors, hydrogen
bond donors, number of rotatable bonds, lipophilicity, solubility,
and topological surface area.
Lipinski’s Rule of Five
was used to evaluate the drug-likeness of the selected compounds.
Key absorption parameters include Caco-2 cell permeability and human
intestinal absorption (HIA). Excretion properties were predicted based
on clearance rate and biological half-life. The SMILES representations
of the reference compounds and active compounds from the LC-MS/MS
analysis were retrieved from PubChem (https://pubchem.ncbi.nlm.nih.gov), and those of the top hit compounds were obtained from the LOTUS
database. The pharmacokinetic properties were determined by submitting
the SMILES strings to the Swiss ADME and ADMET Lab 2.0 web servers.
Results and Discussion
3
Results and Discussion
3.1
Data Distribution Analysis
Following
the application of the IC50 cutoff and the exclusion of
duplicates based on the Tanimoto coefficient (where the coefficient
equals 1), a total of 17,287 unique compounds were identified. Thereafter,
the average IC50 for all remaining duplicates was calculated,
and the data set was classified accordingly. This process resulted
in 6,325 active compounds and 2,179 inactive compounds, totaling 8,504
inhibitors targeting both the EGFR and HER2 receptors. Although compounds
with IC50 values between 1–10 μM (intermediate
class) were excluded from model training to establish a clear binary
classification boundary, we evaluated the trained stacking model on
these compounds to assess its generalizability. The model was applied
to a set of 1,626 intermediate compounds, and their predicted probabilities
were calculated. Of these, 42.8% were predicted as active (probability
≥ 0.5), and 57.2% were predicted as inactive. The predicted
activity probabilities were broadly distributed, with a mean of 0.462
and a standard deviation of 0.123, indicating that the model differentiates
levels of predicted activity even among compounds not seen during
training. A negative Pearson correlation coefficient (r = −0.165) was observed between IC50 values and
predicted probabilities, suggesting a mild inverse relationship: compounds
with lower IC50 within the intermediate range tend to receive
higher predicted activity scores. This demonstrates that the model
is capable of generalizing beyond the binary classification boundaries,
providing biologically meaningful predictions for compounds with ambiguous
potency. These insights may be useful for prioritizing borderline
candidates in virtual screening workflows. The full distribution and
prediction results are provided in Figure S1A,B.
The final data set was then split into a training set
and a test set, with 6,803 compounds designated for training (HER2-TRN)
and 1,701 compounds allocated for testing (HER2-IND).
3.2
Exploratory Data Analysis
A chemical
space analysis was conducted to identify trends and patterns between
the active and inactive compounds. To gain a comprehensive understanding
of the chemical space, the distribution of these compounds was visualized
based on their molecular weight (MW) and the logarithm of the octanol–water
partition coefficient (ALog P).
The Rule-of-Five (Ro5) descriptors were employed to compare the physicochemical
properties of active and inactive compounds. The Ro5 criteria are
based on molecular properties, which include MW (<500 Da), ALog P (<5), the number of hydrogen
bond acceptors (nHAcc < 10), and the number of
hydrogen bond donors (nHDon < 5). These criteria
provide a widely accepted framework for assessing the drug-likeness
of small molecules.
The visualization of the chemical space based on
MW and ALog P is illustrated in Figure
below. The analysis
revealed
that 48.42% of compounds clustered within the MW range of 200–500
Da, with an ALog P between 1 and
5. This suggests that these compounds share favorable structural features
for biological activity. However, 77.8% of the compounds do not meet
either the MW (<500 Da) or ALog P (<5) criteria, and 22.20% fail to meet both, indicating opportunities
for optimization to improve their physicochemical properties and potential
as drug candidates. This highlights the complexities of chemical space
and underscores the importance of taking a nuanced approach to identifying
viable drug candidates.
Additionally, the distribution of active and inactive
compounds
concerning the Ro5 descriptors was examined. The analysis revealed
that the compounds generally conform to the Ro5 guidelines, characterized
by a molecular weight below 500 Da, lipophilicity (ALog P) below 5, and fewer than 10 hydrogen bond
donors and acceptors.
Active and inactive compounds show significant
differences (p < 0.001) in molecular weight (MW),
as demonstrated
by statistical analysis using the Mann–Whitney U test. Inactive
compounds generally have a lower MW (394.27 ± 274.10) compared
to active compounds MW (490.51 ± 267.12), which is evident from
the mean values presented in the boxplots in Figure
. When the threshold of the Rule of Five
(Ro5) is assessed for the number of hydrogen bond acceptors (nHBAcc), the quantities of active and inactive compounds
are similar, with the significance arising primarily from their mean
values shown in Figure
A.
Additionally, the A LogP values
indicate a slight yet notable difference between active (4.71 ±
3.41) and inactive (3.92 ± 3.10) compounds. Both groups exhibited
similar mean values for the number of hydrogen bond donors (nHBDon),
which were not statistically significant. A compound’s molecular
complexity is a crucial factor influencing its clinical efficacy.
Factors such as aromaticity, ring count, chiral centers, fused rings,
functional groups, and the count of rotatable bonds contribute to
this complexity. These characteristics
greatly influence essential biological processes, including toxicity,
oral bioavailability, and solubility, ultimately impacting the compound’s
therapeutic potential.
In addition, four descriptors, including
aromatic ratio (ARR),
number of rings (nCIC), number of rotatable bonds
(RBN), and number of benzene-like rings (nBnz), were
analyzed to assess the molecular complexity of the compounds and compare
the active and inactive groups. The box plot of these descriptors
is illustrated in Figure
(B) below. Active compounds demonstrate a lower ARR ratio,
a greater number of rotatable bonds, and more benzene-like rings compared
to inactive compounds; these differences are statistically significant
(p < 0.001).
3.3
Prediction Outcomes across Various Machine
Learning Algorithms and Molecular Descriptors
A total of
110 machine learning (ML) classifiers were developed using a combination
of 10 molecular descriptors. In this section, we compare 11 different
ML algorithms. The performance of these classifiers was evaluated
using both independent test sets and 5-fold cross-validation training.
The heatmaps in Figure
illustrate the results of the training and independent test evaluations.
They show that certain ML models, such as Decision Tree (DT), Extra
Trees (ET), Random Forest (RF), and Multilayer Perceptron (MLP), consistently
achieved high Matthews Correlation Coefficient (MCC) values exceeding
0.985 (Figure
A).
This indicates strong performance across various molecular descriptors.
Among the molecular descriptors evaluated, CDK,
Extended_FP, Circle_FP,
and Hybrid_FP stood out for their exceptional predictive ability across
most of the models assessed (Figure
B). The ML classifier with the highest cross-validation
MCC was identified as the best-performing model. The top five ML classifiers
identified, along with their respective MCC values, were: ET-CDK Graph
(0.819), ET-Circle (0.814), MLP-Circle (0.814), ET-Hybrid (0.813),
and KNN-CDK Extended (0.810).
Based on the cross-validation
results, the ET-CDK Graph was identified
as the best-performing model; this model had an MCC of 0.819, an ACC
of 0.930, and an AUC of 0.70. This model was also identified as one
of the best-performing models for the independent test, achieving
a Matthews Correlation Coefficient (MCC) of 0.819, an accuracy (ACC)
of 0.930, and an area under the curve (AUC) of 0.70, as shown in Table
. In contrast, the
DT-Hybrid, ET-Hybrid, and ET-CDK Extended models each provided the
highest MCC of 0.987, with an ACC of 0.993 and an AUC of 0.999 in
the independent test. While the Decision Tree (DT) model excelled
on the testing set, it significantly underperformed on the training
set. As a result, it was replaced with the Logistic Regression (LR)
model in the construction of the stacking ensemble, despite not being
among the top five classifiers in the training data (Table
). The LR model demonstrated
more consistent performance across both the training and testing sets,
as illustrated in the heatmap (Figure
). The detailed results of this methodologyis found
in Tables
and .
The ET-CDK Graph model emerged as the top performer
during cross-validation
and also ranked among the best on the independent test set. This indicates
that the ET-CDK Graph, which is based on a single feature, demonstrates
stable performance on both the HER2-TRN and HER2-IND data sets. However,
in this study, we concentrated on combining the best-performing models
(ET, LP, MLP, and RF) across all ten descriptors to build a robust
stacking ensemble that includes the best models. Baseline models were
developed by using combined descriptors for each algorithm (ET, LP,
MLP, and RF). The stacking ensemble was then constructed by taking
into account the outputs from these baseline models. Compared with
single-feature-based models, this approach yielded better performance,
resulting in a more reliable and precise prediction framework.
3.4
Performance Evaluation of the Stacking Ensemble
Compared to the Single-Feature-Based Models
To enhance the
stable performance of our predictions, we integrated multiple machine
learning classifiers to create a meta-model using a stacking strategy.
When compared to single feature-based models, this method has been
shown to produce exceptional results. In an attempt to improve performance, much ML research aggregates
the output of an ensemble of models rather than relying just on one
kind of classifier. The ensemble comprised various base models, with
their predictions aggregated by a Logistic Regression model, which
served as the final estimator.
As presented in Table
, the stacking classifier achieved
a Matthews Correlation Coefficient (MCC) of 0.988 for the HER2-TRN
data set and 1.0 for the HER2-IND data set. Notably, the stacking
classifier delivered the highest metrics across all categories on
the training data, including an accuracy of 0.995, an F1 score of
0.994, and an area under the curve (AUC) of 1.0. It also demonstrated
perfect sensitivity (1.0), indicating its capability to accurately
identify all positive instances while maintaining strong specificity
(98.2F%), which measures its effectiveness in avoiding
false positives. On the test set, the stacking classifier achieved
perfect performance, with 100% across all metrics, including accuracy, F1 score, sensitivity, specificity, MCC, and AUC, as seen
in Table
. It was
ensured that overfitting and data leakage were prevented by strictly
isolating the test set (HER2-IND) during model training and feature
selection. Therefore, this exceptional performance suggests that it
performs well with unseen data. Consequently, the results indicate
that the stacking classifier is the most reliable model for the given
data set, achieving perfect performance metrics on the test set (Table
).
Among the individual single-feature-based models,
Random Forest
(RF) and Extra Trees (ET) are the top performers. However, the stacking
approach’s ability to leverage diverse strengths makes it the
optimal choice for predictions.
3.5
Cytotoxic Activity of Ceratonia
siliqua L. Extract on Colorectal Cancer and Normal
Cell Line
The pods (seed and pulp) of Ceratonia
siliqua L. contain a wide range of biologically active
constituents with abundant polyphenols and have been widely recognized
for their chemopreventive and therapeutic properties against cancer, suggesting the potential presence of anticancer
compounds in this plant. The therapeutic capability of Ceratonia siliqua in the treatment of colorectal
cancer is revealed by the binding between polyphenols and dietary
fiber, which not only preserves polyphenols during digestion but also
increases their bioactivity in the gut and colon.
,
In this study, Ceratonia siliqua L.
(C. siliqua) pod extract exhibited
cytotoxic effects against human colorectal cancer (HCT116) cells and
was less toxic to normal monkey kidney epithelial (Vero) cell lines,
as evaluated using the MTT cytotoxicity method. According to the criteria
set by the American National Cancer Institute (NCI), a crude extract
is considered to have significant cytotoxic activity if it demonstrates
an IC50 value below 20 μg/mL. As shown in Table
, the C. siliqua extract exhibited
an IC50 value of 13.32 ± 1.09 μg/mL against
HCT116 cells, placing it within the NCI threshold and indicating significant
anticancer potential. On the other hand, the C. siliqua extracts was found to be less toxic to Vero cells, with an IC50 of 21.39 ± 1.30 μg/mL. The observed results were
compared to the FDA-approved anticancer drug doxorubicin, which served
as a positive control and demonstrated IC50 values of 1.85
± 0.21 μg/mL for HCT116 and 1.61 ± 0.26 μg/mL
for Vero cells. These findings indicate that the C.
siliqua extract selectively exerts a greater inhibitory
effect on cancerous cells than on normal cells.
The dose–response curves (Figure
) provide a comparative analysis
between C. siliqua extract and doxorubicin,
a standard chemotherapeutic
agent. As shown in Figure
, cell viability decreased with increasing concentrations
of both C. siliqua extract and doxorubicin,
demonstrating a clear dose-dependent cytotoxic effect. Notably, doxorubicin
exhibited strong cytotoxicity at lower concentrations compared to
the C. siliqua extract, which exhibited
moderate cytotoxicity against HCT116, indicating significant anticancer
potential.
Therefore, MTT results suggest that C. siliqua extract exerts anticancer activity, likely
involving bioactive phytochemicals
capable of disrupting cancer cell metabolism.
,
These findings are consistent with previous studies that reported
similar dose-dependent antiproliferative effects of the CS extract
on various cancer cell lines. For example, Hsouna et al. (2011) observed
comparable cytotoxic activities of Ceratonia siliqua
L. and other Mediterranean plant extracts against
cancer cells. Likewise, Amessis-Ouchemoukh
et al. (2017) determined the bioactive metabolites involved in the
antioxidant, anticancer, and anticalpain activities of Ceratonia siliqua
L. extracts.
3.6
Identification of Bioactive Compounds in LC-MS/MS
and Virtual Screening
Table
presents the compounds identified through LC-MS/MS
analysis of Ceratonia siliqua
L. pod extract. Compounds present in C. siliqua were identified by comparing the obtained molecular precursor ions
and MS/MS fragmentation patterns from our LC-MS/MS data with those
from publicly available mass spectral libraries and relevant literature.
For each identified compound, the retention time (RT), mass-to-charge
ratio (m/z), molecular formula,
and mass errors between their calculated and measured molecular masses
are summarized in Table
.
The identified compounds were subjected to virtual
screening against
both HER2 and EGFR targets using our custom-built stacking ensemble
model to predict their activity status (active or inactive), along
with their respective probabilities of prediction. The same stacking
ensemble model was also used to evaluate the compounds for predicted
activity against either HER2 or EGFR individually, while also providing
the corresponding probability scores. This screening aimed to assess
their therapeutic potential for the treatment of colorectal cancer.
A total of 21 compounds were identified (11 in positive mode and
10 in negative mode) in the aqueous extract of Ceratonia
siliqua
L using the MS/MS spectra
(Table
). A few important
classes of bioactive compounds, such as flavonoids, phenolics, terpenoids,
alkaloids, and cinnamic acids, were detected in the extract of C. siliqua. Previous studies have demonstrated that,
out of the 21 compounds, 2 flavonoids, 1 phenol, and 5 terpenoids
were detected as major contributors to the antioxidant activity and
pharmacological functions of Ceratonia siliqua
L. The remaining classes of compounds, such as fatty
acids, alkaloids, amino acids, and peptides, are also believed to
partly contribute to the medicinal value of Ceratonia
siliqua based on previous reports.
,
The dual-inhibition screening of standard FDA-approved drugs
(Doxorubicin,
Abemaciclib, Gemcitabine, and Tucatinib) using our stacked ensemble
model provided a comparative benchmark for interpreting the virtual
screening predictions of compounds identified from Ceratonia siliqua L. (Table
). All the reference compounds, except Gemcitabine,
demonstrated strong predicted activity against both EGFR and HER2,
confirming the model’s ability to recognize well-characterized
dual inhibitors. This aligns with their known clinical applications
in various cancers.
When these benchmarks are compared to the predictions
for the LC-MS/MS-derived
phytochemicals, only NCGC00385704–01 demonstrated moderate
dual-target activity (probability = 0.513), suggesting that it could
potentially act as a lead dual EGFR/HER2 inhibitor, albeit with lower
potency compared to FDA-approved drugs. Given that HER2 and EGFR are
overexpressed in up to 85% of colorectal cancer cases, this compound
holds the potential to suppress a substantial proportion of colorectal
cancer tumors.
Moreover, several identified natural compounds
from Ceratonia siliqua
L. displayed strong
predicted inhibition of either EGFR or HER2, but not both. For instance,
chrysin and 1-caffeoyl-β-d-glucose showed HER2 inhibition
with probabilities of 73.5% and 65.0%, respectively. Gossypetin and
phenazine-1-carboxylic acid were active against EGFR with predicted
probabilities of 68.1% and 63.0%, respectively. In addition, these
compounds have previously been reported to inhibit either EGFR or
HER2; however, there are no records of multiple activities against
both targets.
,
These findings are relevant
given that many phytochemicals exhibit
multitargeted anticancer effects that extend beyond EGFR/HER2 inhibition.
For example, chrysin and gossypetin are known to modulate oxidative
stress and inflammatory pathways, such as NF-κB and COX-2, which
are involved in cancer progression. Although
predicted as inactive in the dual inhibition model, their selective
activity toward a single receptor suggests complementary mechanisms
of action that could be synergistic in combination therapies or multitarget
drug design. Therefore, the significant in vitro cytotoxicity
observed in HCT116 cells (IC50 = 13.32 ± 1.09 μg/mL)
may arise from these multifactorial interactions rather than direct
HER2/EGFR blockade.
In addition to polyphenolic constituents
and their subclasses,
several carbohydrate-based compounds (melezitose and sucrose) were
detected, suggesting the presence of dietary fibers and indigestible
oligosaccharides in the extract. Previous reports have shown that
carob fiber present in the Ceratonia siliqua
L. pods is composed
of insoluble polysaccharides that can interact with phenolic compounds,
influencing their bioavailability and the functional activity of polyphenols
in the colon, thereby contributing to gut health and potential chemopreventive
effects against colorectal cancer.
,
3.7
Application of the Stacking Method for the
Virtual Screening of Natural Compounds
A built stacking ensemble
model was utilized to virtually screen and calculate the probabilities
of 1,400 compounds from the LOTUS database, aiming to identify the
most promising candidates with activity against both HER2 and EGFR.
Of these compounds, 58 were found to exhibit activity against both
targets. The top ten compounds most likely to inhibit both HER2 and
EGFR, along with their top three chemical structures, are identified
(Table
and Figure
). Additionally,
molecular docking studies assessed the binding modes and affinities
of the top three compounds in comparison to those of an FDA-approved
cancer drug.
3.8
Molecular Docking
Molecular docking
is a pivotal computational approach for elucidating ligand–receptor
interactions at the atomic level, offering insights into biochemical
mechanisms.
Table
presents the molecular docking results assessing
the activity of compounds in Figure
against both Human Epidermal Growth Factor Receptor
2 (HER2), a member of the ErbB receptor tyrosine kinase family, and
Epidermal Growth Factor Receptor (EGFR). HER2 drives oncogenesis through
hyperactivation of the PI3K/AKT and MAPK/ERK1/2 pathways, often caused
by gene amplification or activating mutations. This aberrant activity,
observed in about 20% of breast cancers, is linked to poor outcomes
in esophagogastric, ovarian, bladder, and colorectal cancers. The binding affinities, expressed as Gibbs free
energy (ΔG°, kcal/mol), confirm the stability
and strength of the interactions between the ligands and target proteins,
with lower values indicating higher affinity. The inhibition constant
(Ki
) was derived from the docking binding
energy (ΔG°) with a correction of +0.01
kcal/mol using Eq.
,
which reflects a drug’s potential to inhibit enzymatic activity.
Lower K
i
values correspond
to higher binding affinity and greater therapeutic potential.
Here, R is the gas
constant (0.001987 kcal K–1 mol–1), and T is
the temperature (298.15 K). The Gibbs free energy (ΔG°, kcal/mol) highlights the equilibrium state and
stability of the protein–ligand complexes. Molecular docking
analysis, a vital tool for predicting atomic-level interactions between
ligands and proteins, provides insights into biochemical processes.
Table
shows the
binding energies and inhibition constants derived from molecular docking
studies of the top four-performing compounds predicted by our machine
learning model and four standard drugs when complexed with the HER2
and EGFR receptors. The docking analysis identified LTS0018035 as
the most potent compound against HER2 (PDB ID: 7MN5), with a binding
energy of −11.2 kcal/mol and a Ki
of 0.0063 μM, outperforming all standard drugs. LTS0131924
and LTS0018034 followed with binding energies of −10.0 kcal/mol
(Ki
= 0.0475 μM) and −8.7
kcal/mol (Ki
= 0.4261 μM), respectively,
while NCGC00385704–01 recorded −7.2 kcal/mol (Ki
= 5.3598 μM). Notably, LTS0018035 surpassed
Tucatinib, the best-performing standard drug, while LTS0131924, LTS0018034,
and NCGC00385704–01 showed marginal variations of (0.5 &
0.1 kcal/mol), (1.8 and 1.2 kcal/mol), and (3.3 and 2.7 kcal/mol),
respectively, from the top-performing standard drugs (Tucatinib and
Abemaciclib) (Table
).
Figure
illustrates
the molecular interactions of the top three compounds with 7MN5, while Figure
shows the molecular
interactions of NCGC00385704–01 and the standard drugs. Among
the studied compounds, LTS0018035, which exhibited the lowest inhibitory
concentration, has the highest number of hydrogen bonds (10) within
the protein’s binding pocket, indicating its stability. LTS0018034
and LTS0131924 followed with 9 and 7 hydrogen bonds, respectively.
Hydrogen bonding is a key determinant of strong protein–ligand
interactions, and it enhances binding affinity and inhibitory potency.
The studied compounds demonstrated more hydrogen bond donors than
the standards, apart from NCGC00385704–01, facilitating their
extensive hydrogen bond interactions. Notably, no interactions other
than hydrogen bonds were observed for the selected compounds, except
for NCGC00385704–01, which has a Pi-alkyl interaction with
PHE 258. Key residues involved in binding within the receptor’s
active site included HIS 248, PRO 257, and ARG 258. Although these
ligands did not directly interact with the receptor’s active
site residues, their binding patterns suggest potential as partial
agonists or allosteric modulators, warranting further pharmacological
evaluation. LTS0018035, with the lowest binding energy, was identified
as the most stable ligand–receptor complex.
For the mutated EGFR (PDB ID: 7ZYM), Tucatinib displayed
the highest binding
affinity (−10.1 kcal/mol, Ki
=
0.0401 μM), followed by Doxorubicin (−9.5 kcal/mol, Ki
= 0.1104 μM) and Abemaciclib (−9.2
kcal/mol, Ki
= 0.1832 μM). Among
the novel ligands, LTS0131924 showed the highest affinity (−8.5
kcal/mol, Ki
= 0.5972 μM), followed
by LTS0018034 and NCGC00385704–01 (both −8.0 kcal/mol, Ki
= 1.3890 μM), and LTS0018035 (−7.8
kcal/mol, Ki
= 1.9467 μM), all outperforming
Gemcitabine (−6.1 kcal/mol, Ki
=
34.3194 μM). Even though the reference standards perform better
than the studied compounds in binding to the EGFR target, 7ZYM does
not (Table
). However,
binding mode analyses (Figures
and ) reveal insights into the selectivity
of the molecules targeting the EGFR-T790M/C797S binding site. The
primary role of this binding site is to facilitate ATP binding and
phosphorylation, driving PI3K/AKT and MAPK/ERK signaling to promote
cell proliferation and survival. Meanwhile,
the T790M mutation increases ATP affinity, reducing the efficacy of
first-generation TKIs, while the C797S mutation prevents covalent
binding of third-generation inhibitors like Osimertinib, necessitating
novel inhibitors to block aberrant signaling. For this selected protein, molecular interactions with catalytic
residues like LYS745, MET790, MET793, SER797, and ASP800 appear critical
for the EGFR mutant selectivity, inducing
conformational changes in the binding pocket that inhibit tumor growth.
Remarkably, Figure
shows that LTS0018035 and LTS0018034 formed hydrogen bonds with
the acidic side chain of ASP800 (2.63 and 3.34 Å, respectively),
and LTS0018034 also interacted with the DFG motif residue ASP855 with
one of its carbonyl attachments.
Furthermore, NCGC00385704–01 engaged the
catalytic LYS745
through its carbonyl oxygen, similarly to known EGFR inhibitors (consistent
with the results from this study, e.g., Doxorubicin and Gemcitabine)
(Figure
). Abemaciclib formed hydrogen bonds with ASP 800
and ASP 855, alongside a π-sigma interaction with LEU718, while
Gemcitabine exhibited an unfavorable donor–donor interaction
with LYS 745, potentially explaining its lower affinity. However,
Tucatinib’s high affinity may be attributed to a hydrogen bond
from its amine bridge with the DFG motif ASP855 side chain. Hydrogen
bonding significantly influenced ligand-binding efficacy, with LTS0018035
forming the highest number of hydrogen bonds (6), followed by LTS0018034
and LTS0131924 (5 each), and NCGC00385704–01 forming 3, comparable
to some Abemaciclib, Doxorubicin, and Tucatinib (Figure
). The sandwich motif between
MET790 and LYS745 suggests strategies for improving inhibitory activity
against T790M-mutated EGFR, particularly in addressing C797S resistance
mutations prevalent in nonsmall cell lung cancer (NSCLC). We are confident that these findings will provide
a foundation for developing reversible inhibitors targeting EGFR resistance
mutations, contributing to the advancement of precision oncology.
3.9
Analysis of ADMET Properties
The
ADMET properties of the three top-hit compounds from virtual screening,
alongside four FDA-approved reference drugs, were examined using the
Swiss ADMET and ADMET Lab 2.0 web servers. The results were interpreted
based on the criteria provided by the ADMET Lab server. The oral bioavailability
of the molecules was evaluated based on Lipinski’s Rule of
Five (molecular weight (MW) ≤ 500 Da, lipophilicity index (log P) ≤ 5, hydrogen bond donors (HBDs) ≤ 5, and
hydrogen bond acceptors (HBAs) ≤ 10). Violation of more than
one of these rules typically indicates poor absorption potential.
The absorption parameters included human intestinal absorption (HIA),
solubility class, and Caco-2 permeability (C2P). Excretion parameters,
such as clearance rate and biological half-life, were also evaluated.
A summary of the pharmacokinetic properties of the compounds is shown
in Table
.
All four virtually screened active compounds (NCGC00385704–01,
LTS0018034, LTS0131923, and LTS0258251) exhibited no signs of toxicity.
They all had low log p values (<3) and high topological
polar surface area (TPSA > 75), meeting the criteria of Pfizer’s
rule, which suggests that compounds with log P >
3 and TPSA <75 are more likely to be toxic. Among the reference
drugs, Abemaciclib and Tucatinib were observed to obey Lipinski’s
rules, exhibited moderate water solubility, and showed high human
intestinal absorption, suggesting good oral bioavailability. In contrast,
Doxorubicin and Gemcitabine violated more than one of Lipinski’s
rules, with high solubility but low predicted intestinal absorption.
All three virtually screened hit compounds (LTS0018034, LTS0131923,
and LTS0258251) also violated multiple Lipinski parameters and were
predicted to have poor solubility and low intestinal absorption. Abemaciclib
and Tucatinib produced acceptable Caco-2 permeability values, which,
alongside their high HIA, indicate improved absorption properties.
Regarding excretion, clearance levels were observed to be higher for
Doxorubicin and moderate for Gemcitabine, Abemaciclib, and Tucatinib.
The three virtual hit compounds exhibited low clearance levels, with
LTS0258251 showing a longer half-life probability, which may be advantageous
for sustained therapeutic effects.
Results and Discussion
3.1
Data Distribution Analysis
Following
the application of the IC50 cutoff and the exclusion of
duplicates based on the Tanimoto coefficient (where the coefficient
equals 1), a total of 17,287 unique compounds were identified. Thereafter,
the average IC50 for all remaining duplicates was calculated,
and the data set was classified accordingly. This process resulted
in 6,325 active compounds and 2,179 inactive compounds, totaling 8,504
inhibitors targeting both the EGFR and HER2 receptors. Although compounds
with IC50 values between 1–10 μM (intermediate
class) were excluded from model training to establish a clear binary
classification boundary, we evaluated the trained stacking model on
these compounds to assess its generalizability. The model was applied
to a set of 1,626 intermediate compounds, and their predicted probabilities
were calculated. Of these, 42.8% were predicted as active (probability
≥ 0.5), and 57.2% were predicted as inactive. The predicted
activity probabilities were broadly distributed, with a mean of 0.462
and a standard deviation of 0.123, indicating that the model differentiates
levels of predicted activity even among compounds not seen during
training. A negative Pearson correlation coefficient (r = −0.165) was observed between IC50 values and
predicted probabilities, suggesting a mild inverse relationship: compounds
with lower IC50 within the intermediate range tend to receive
higher predicted activity scores. This demonstrates that the model
is capable of generalizing beyond the binary classification boundaries,
providing biologically meaningful predictions for compounds with ambiguous
potency. These insights may be useful for prioritizing borderline
candidates in virtual screening workflows. The full distribution and
prediction results are provided in Figure S1A,B.
The final data set was then split into a training set
and a test set, with 6,803 compounds designated for training (HER2-TRN)
and 1,701 compounds allocated for testing (HER2-IND).
3.2
Exploratory Data Analysis
A chemical
space analysis was conducted to identify trends and patterns between
the active and inactive compounds. To gain a comprehensive understanding
of the chemical space, the distribution of these compounds was visualized
based on their molecular weight (MW) and the logarithm of the octanol–water
partition coefficient (ALog P).
The Rule-of-Five (Ro5) descriptors were employed to compare the physicochemical
properties of active and inactive compounds. The Ro5 criteria are
based on molecular properties, which include MW (<500 Da), ALog P (<5), the number of hydrogen
bond acceptors (nHAcc < 10), and the number of
hydrogen bond donors (nHDon < 5). These criteria
provide a widely accepted framework for assessing the drug-likeness
of small molecules.
The visualization of the chemical space based on
MW and ALog P is illustrated in Figure
below. The analysis
revealed
that 48.42% of compounds clustered within the MW range of 200–500
Da, with an ALog P between 1 and
5. This suggests that these compounds share favorable structural features
for biological activity. However, 77.8% of the compounds do not meet
either the MW (<500 Da) or ALog P (<5) criteria, and 22.20% fail to meet both, indicating opportunities
for optimization to improve their physicochemical properties and potential
as drug candidates. This highlights the complexities of chemical space
and underscores the importance of taking a nuanced approach to identifying
viable drug candidates.
Additionally, the distribution of active and inactive
compounds
concerning the Ro5 descriptors was examined. The analysis revealed
that the compounds generally conform to the Ro5 guidelines, characterized
by a molecular weight below 500 Da, lipophilicity (ALog P) below 5, and fewer than 10 hydrogen bond
donors and acceptors.
Active and inactive compounds show significant
differences (p < 0.001) in molecular weight (MW),
as demonstrated
by statistical analysis using the Mann–Whitney U test. Inactive
compounds generally have a lower MW (394.27 ± 274.10) compared
to active compounds MW (490.51 ± 267.12), which is evident from
the mean values presented in the boxplots in Figure
. When the threshold of the Rule of Five
(Ro5) is assessed for the number of hydrogen bond acceptors (nHBAcc), the quantities of active and inactive compounds
are similar, with the significance arising primarily from their mean
values shown in Figure
A.
Additionally, the A LogP values
indicate a slight yet notable difference between active (4.71 ±
3.41) and inactive (3.92 ± 3.10) compounds. Both groups exhibited
similar mean values for the number of hydrogen bond donors (nHBDon),
which were not statistically significant. A compound’s molecular
complexity is a crucial factor influencing its clinical efficacy.
Factors such as aromaticity, ring count, chiral centers, fused rings,
functional groups, and the count of rotatable bonds contribute to
this complexity. These characteristics
greatly influence essential biological processes, including toxicity,
oral bioavailability, and solubility, ultimately impacting the compound’s
therapeutic potential.
In addition, four descriptors, including
aromatic ratio (ARR),
number of rings (nCIC), number of rotatable bonds
(RBN), and number of benzene-like rings (nBnz), were
analyzed to assess the molecular complexity of the compounds and compare
the active and inactive groups. The box plot of these descriptors
is illustrated in Figure
(B) below. Active compounds demonstrate a lower ARR ratio,
a greater number of rotatable bonds, and more benzene-like rings compared
to inactive compounds; these differences are statistically significant
(p < 0.001).
3.3
Prediction Outcomes across Various Machine
Learning Algorithms and Molecular Descriptors
A total of
110 machine learning (ML) classifiers were developed using a combination
of 10 molecular descriptors. In this section, we compare 11 different
ML algorithms. The performance of these classifiers was evaluated
using both independent test sets and 5-fold cross-validation training.
The heatmaps in Figure
illustrate the results of the training and independent test evaluations.
They show that certain ML models, such as Decision Tree (DT), Extra
Trees (ET), Random Forest (RF), and Multilayer Perceptron (MLP), consistently
achieved high Matthews Correlation Coefficient (MCC) values exceeding
0.985 (Figure
A).
This indicates strong performance across various molecular descriptors.
Among the molecular descriptors evaluated, CDK,
Extended_FP, Circle_FP,
and Hybrid_FP stood out for their exceptional predictive ability across
most of the models assessed (Figure
B). The ML classifier with the highest cross-validation
MCC was identified as the best-performing model. The top five ML classifiers
identified, along with their respective MCC values, were: ET-CDK Graph
(0.819), ET-Circle (0.814), MLP-Circle (0.814), ET-Hybrid (0.813),
and KNN-CDK Extended (0.810).
Based on the cross-validation
results, the ET-CDK Graph was identified
as the best-performing model; this model had an MCC of 0.819, an ACC
of 0.930, and an AUC of 0.70. This model was also identified as one
of the best-performing models for the independent test, achieving
a Matthews Correlation Coefficient (MCC) of 0.819, an accuracy (ACC)
of 0.930, and an area under the curve (AUC) of 0.70, as shown in Table
. In contrast, the
DT-Hybrid, ET-Hybrid, and ET-CDK Extended models each provided the
highest MCC of 0.987, with an ACC of 0.993 and an AUC of 0.999 in
the independent test. While the Decision Tree (DT) model excelled
on the testing set, it significantly underperformed on the training
set. As a result, it was replaced with the Logistic Regression (LR)
model in the construction of the stacking ensemble, despite not being
among the top five classifiers in the training data (Table
). The LR model demonstrated
more consistent performance across both the training and testing sets,
as illustrated in the heatmap (Figure
). The detailed results of this methodologyis found
in Tables
and .
The ET-CDK Graph model emerged as the top performer
during cross-validation
and also ranked among the best on the independent test set. This indicates
that the ET-CDK Graph, which is based on a single feature, demonstrates
stable performance on both the HER2-TRN and HER2-IND data sets. However,
in this study, we concentrated on combining the best-performing models
(ET, LP, MLP, and RF) across all ten descriptors to build a robust
stacking ensemble that includes the best models. Baseline models were
developed by using combined descriptors for each algorithm (ET, LP,
MLP, and RF). The stacking ensemble was then constructed by taking
into account the outputs from these baseline models. Compared with
single-feature-based models, this approach yielded better performance,
resulting in a more reliable and precise prediction framework.
3.4
Performance Evaluation of the Stacking Ensemble
Compared to the Single-Feature-Based Models
To enhance the
stable performance of our predictions, we integrated multiple machine
learning classifiers to create a meta-model using a stacking strategy.
When compared to single feature-based models, this method has been
shown to produce exceptional results. In an attempt to improve performance, much ML research aggregates
the output of an ensemble of models rather than relying just on one
kind of classifier. The ensemble comprised various base models, with
their predictions aggregated by a Logistic Regression model, which
served as the final estimator.
As presented in Table
, the stacking classifier achieved
a Matthews Correlation Coefficient (MCC) of 0.988 for the HER2-TRN
data set and 1.0 for the HER2-IND data set. Notably, the stacking
classifier delivered the highest metrics across all categories on
the training data, including an accuracy of 0.995, an F1 score of
0.994, and an area under the curve (AUC) of 1.0. It also demonstrated
perfect sensitivity (1.0), indicating its capability to accurately
identify all positive instances while maintaining strong specificity
(98.2F%), which measures its effectiveness in avoiding
false positives. On the test set, the stacking classifier achieved
perfect performance, with 100% across all metrics, including accuracy, F1 score, sensitivity, specificity, MCC, and AUC, as seen
in Table
. It was
ensured that overfitting and data leakage were prevented by strictly
isolating the test set (HER2-IND) during model training and feature
selection. Therefore, this exceptional performance suggests that it
performs well with unseen data. Consequently, the results indicate
that the stacking classifier is the most reliable model for the given
data set, achieving perfect performance metrics on the test set (Table
).
Among the individual single-feature-based models,
Random Forest
(RF) and Extra Trees (ET) are the top performers. However, the stacking
approach’s ability to leverage diverse strengths makes it the
optimal choice for predictions.
3.5
Cytotoxic Activity of Ceratonia
siliqua L. Extract on Colorectal Cancer and Normal
Cell Line
The pods (seed and pulp) of Ceratonia
siliqua L. contain a wide range of biologically active
constituents with abundant polyphenols and have been widely recognized
for their chemopreventive and therapeutic properties against cancer, suggesting the potential presence of anticancer
compounds in this plant. The therapeutic capability of Ceratonia siliqua in the treatment of colorectal
cancer is revealed by the binding between polyphenols and dietary
fiber, which not only preserves polyphenols during digestion but also
increases their bioactivity in the gut and colon.
,
In this study, Ceratonia siliqua L.
(C. siliqua) pod extract exhibited
cytotoxic effects against human colorectal cancer (HCT116) cells and
was less toxic to normal monkey kidney epithelial (Vero) cell lines,
as evaluated using the MTT cytotoxicity method. According to the criteria
set by the American National Cancer Institute (NCI), a crude extract
is considered to have significant cytotoxic activity if it demonstrates
an IC50 value below 20 μg/mL. As shown in Table
, the C. siliqua extract exhibited
an IC50 value of 13.32 ± 1.09 μg/mL against
HCT116 cells, placing it within the NCI threshold and indicating significant
anticancer potential. On the other hand, the C. siliqua extracts was found to be less toxic to Vero cells, with an IC50 of 21.39 ± 1.30 μg/mL. The observed results were
compared to the FDA-approved anticancer drug doxorubicin, which served
as a positive control and demonstrated IC50 values of 1.85
± 0.21 μg/mL for HCT116 and 1.61 ± 0.26 μg/mL
for Vero cells. These findings indicate that the C.
siliqua extract selectively exerts a greater inhibitory
effect on cancerous cells than on normal cells.
The dose–response curves (Figure
) provide a comparative analysis
between C. siliqua extract and doxorubicin,
a standard chemotherapeutic
agent. As shown in Figure
, cell viability decreased with increasing concentrations
of both C. siliqua extract and doxorubicin,
demonstrating a clear dose-dependent cytotoxic effect. Notably, doxorubicin
exhibited strong cytotoxicity at lower concentrations compared to
the C. siliqua extract, which exhibited
moderate cytotoxicity against HCT116, indicating significant anticancer
potential.
Therefore, MTT results suggest that C. siliqua extract exerts anticancer activity, likely
involving bioactive phytochemicals
capable of disrupting cancer cell metabolism.
,
These findings are consistent with previous studies that reported
similar dose-dependent antiproliferative effects of the CS extract
on various cancer cell lines. For example, Hsouna et al. (2011) observed
comparable cytotoxic activities of Ceratonia siliqua
L. and other Mediterranean plant extracts against
cancer cells. Likewise, Amessis-Ouchemoukh
et al. (2017) determined the bioactive metabolites involved in the
antioxidant, anticancer, and anticalpain activities of Ceratonia siliqua
L. extracts.
3.6
Identification of Bioactive Compounds in LC-MS/MS
and Virtual Screening
Table
presents the compounds identified through LC-MS/MS
analysis of Ceratonia siliqua
L. pod extract. Compounds present in C. siliqua were identified by comparing the obtained molecular precursor ions
and MS/MS fragmentation patterns from our LC-MS/MS data with those
from publicly available mass spectral libraries and relevant literature.
For each identified compound, the retention time (RT), mass-to-charge
ratio (m/z), molecular formula,
and mass errors between their calculated and measured molecular masses
are summarized in Table
.
The identified compounds were subjected to virtual
screening against
both HER2 and EGFR targets using our custom-built stacking ensemble
model to predict their activity status (active or inactive), along
with their respective probabilities of prediction. The same stacking
ensemble model was also used to evaluate the compounds for predicted
activity against either HER2 or EGFR individually, while also providing
the corresponding probability scores. This screening aimed to assess
their therapeutic potential for the treatment of colorectal cancer.
A total of 21 compounds were identified (11 in positive mode and
10 in negative mode) in the aqueous extract of Ceratonia
siliqua
L using the MS/MS spectra
(Table
). A few important
classes of bioactive compounds, such as flavonoids, phenolics, terpenoids,
alkaloids, and cinnamic acids, were detected in the extract of C. siliqua. Previous studies have demonstrated that,
out of the 21 compounds, 2 flavonoids, 1 phenol, and 5 terpenoids
were detected as major contributors to the antioxidant activity and
pharmacological functions of Ceratonia siliqua
L. The remaining classes of compounds, such as fatty
acids, alkaloids, amino acids, and peptides, are also believed to
partly contribute to the medicinal value of Ceratonia
siliqua based on previous reports.
,
The dual-inhibition screening of standard FDA-approved drugs
(Doxorubicin,
Abemaciclib, Gemcitabine, and Tucatinib) using our stacked ensemble
model provided a comparative benchmark for interpreting the virtual
screening predictions of compounds identified from Ceratonia siliqua L. (Table
). All the reference compounds, except Gemcitabine,
demonstrated strong predicted activity against both EGFR and HER2,
confirming the model’s ability to recognize well-characterized
dual inhibitors. This aligns with their known clinical applications
in various cancers.
When these benchmarks are compared to the predictions
for the LC-MS/MS-derived
phytochemicals, only NCGC00385704–01 demonstrated moderate
dual-target activity (probability = 0.513), suggesting that it could
potentially act as a lead dual EGFR/HER2 inhibitor, albeit with lower
potency compared to FDA-approved drugs. Given that HER2 and EGFR are
overexpressed in up to 85% of colorectal cancer cases, this compound
holds the potential to suppress a substantial proportion of colorectal
cancer tumors.
Moreover, several identified natural compounds
from Ceratonia siliqua
L. displayed strong
predicted inhibition of either EGFR or HER2, but not both. For instance,
chrysin and 1-caffeoyl-β-d-glucose showed HER2 inhibition
with probabilities of 73.5% and 65.0%, respectively. Gossypetin and
phenazine-1-carboxylic acid were active against EGFR with predicted
probabilities of 68.1% and 63.0%, respectively. In addition, these
compounds have previously been reported to inhibit either EGFR or
HER2; however, there are no records of multiple activities against
both targets.
,
These findings are relevant
given that many phytochemicals exhibit
multitargeted anticancer effects that extend beyond EGFR/HER2 inhibition.
For example, chrysin and gossypetin are known to modulate oxidative
stress and inflammatory pathways, such as NF-κB and COX-2, which
are involved in cancer progression. Although
predicted as inactive in the dual inhibition model, their selective
activity toward a single receptor suggests complementary mechanisms
of action that could be synergistic in combination therapies or multitarget
drug design. Therefore, the significant in vitro cytotoxicity
observed in HCT116 cells (IC50 = 13.32 ± 1.09 μg/mL)
may arise from these multifactorial interactions rather than direct
HER2/EGFR blockade.
In addition to polyphenolic constituents
and their subclasses,
several carbohydrate-based compounds (melezitose and sucrose) were
detected, suggesting the presence of dietary fibers and indigestible
oligosaccharides in the extract. Previous reports have shown that
carob fiber present in the Ceratonia siliqua
L. pods is composed
of insoluble polysaccharides that can interact with phenolic compounds,
influencing their bioavailability and the functional activity of polyphenols
in the colon, thereby contributing to gut health and potential chemopreventive
effects against colorectal cancer.
,
3.7
Application of the Stacking Method for the
Virtual Screening of Natural Compounds
A built stacking ensemble
model was utilized to virtually screen and calculate the probabilities
of 1,400 compounds from the LOTUS database, aiming to identify the
most promising candidates with activity against both HER2 and EGFR.
Of these compounds, 58 were found to exhibit activity against both
targets. The top ten compounds most likely to inhibit both HER2 and
EGFR, along with their top three chemical structures, are identified
(Table
and Figure
). Additionally,
molecular docking studies assessed the binding modes and affinities
of the top three compounds in comparison to those of an FDA-approved
cancer drug.
3.8
Molecular Docking
Molecular docking
is a pivotal computational approach for elucidating ligand–receptor
interactions at the atomic level, offering insights into biochemical
mechanisms.
Table
presents the molecular docking results assessing
the activity of compounds in Figure
against both Human Epidermal Growth Factor Receptor
2 (HER2), a member of the ErbB receptor tyrosine kinase family, and
Epidermal Growth Factor Receptor (EGFR). HER2 drives oncogenesis through
hyperactivation of the PI3K/AKT and MAPK/ERK1/2 pathways, often caused
by gene amplification or activating mutations. This aberrant activity,
observed in about 20% of breast cancers, is linked to poor outcomes
in esophagogastric, ovarian, bladder, and colorectal cancers. The binding affinities, expressed as Gibbs free
energy (ΔG°, kcal/mol), confirm the stability
and strength of the interactions between the ligands and target proteins,
with lower values indicating higher affinity. The inhibition constant
(Ki
) was derived from the docking binding
energy (ΔG°) with a correction of +0.01
kcal/mol using Eq.
,
which reflects a drug’s potential to inhibit enzymatic activity.
Lower K
i
values correspond
to higher binding affinity and greater therapeutic potential.
Here, R is the gas
constant (0.001987 kcal K–1 mol–1), and T is
the temperature (298.15 K). The Gibbs free energy (ΔG°, kcal/mol) highlights the equilibrium state and
stability of the protein–ligand complexes. Molecular docking
analysis, a vital tool for predicting atomic-level interactions between
ligands and proteins, provides insights into biochemical processes.
Table
shows the
binding energies and inhibition constants derived from molecular docking
studies of the top four-performing compounds predicted by our machine
learning model and four standard drugs when complexed with the HER2
and EGFR receptors. The docking analysis identified LTS0018035 as
the most potent compound against HER2 (PDB ID: 7MN5), with a binding
energy of −11.2 kcal/mol and a Ki
of 0.0063 μM, outperforming all standard drugs. LTS0131924
and LTS0018034 followed with binding energies of −10.0 kcal/mol
(Ki
= 0.0475 μM) and −8.7
kcal/mol (Ki
= 0.4261 μM), respectively,
while NCGC00385704–01 recorded −7.2 kcal/mol (Ki
= 5.3598 μM). Notably, LTS0018035 surpassed
Tucatinib, the best-performing standard drug, while LTS0131924, LTS0018034,
and NCGC00385704–01 showed marginal variations of (0.5 &
0.1 kcal/mol), (1.8 and 1.2 kcal/mol), and (3.3 and 2.7 kcal/mol),
respectively, from the top-performing standard drugs (Tucatinib and
Abemaciclib) (Table
).
Figure
illustrates
the molecular interactions of the top three compounds with 7MN5, while Figure
shows the molecular
interactions of NCGC00385704–01 and the standard drugs. Among
the studied compounds, LTS0018035, which exhibited the lowest inhibitory
concentration, has the highest number of hydrogen bonds (10) within
the protein’s binding pocket, indicating its stability. LTS0018034
and LTS0131924 followed with 9 and 7 hydrogen bonds, respectively.
Hydrogen bonding is a key determinant of strong protein–ligand
interactions, and it enhances binding affinity and inhibitory potency.
The studied compounds demonstrated more hydrogen bond donors than
the standards, apart from NCGC00385704–01, facilitating their
extensive hydrogen bond interactions. Notably, no interactions other
than hydrogen bonds were observed for the selected compounds, except
for NCGC00385704–01, which has a Pi-alkyl interaction with
PHE 258. Key residues involved in binding within the receptor’s
active site included HIS 248, PRO 257, and ARG 258. Although these
ligands did not directly interact with the receptor’s active
site residues, their binding patterns suggest potential as partial
agonists or allosteric modulators, warranting further pharmacological
evaluation. LTS0018035, with the lowest binding energy, was identified
as the most stable ligand–receptor complex.
For the mutated EGFR (PDB ID: 7ZYM), Tucatinib displayed
the highest binding
affinity (−10.1 kcal/mol, Ki
=
0.0401 μM), followed by Doxorubicin (−9.5 kcal/mol, Ki
= 0.1104 μM) and Abemaciclib (−9.2
kcal/mol, Ki
= 0.1832 μM). Among
the novel ligands, LTS0131924 showed the highest affinity (−8.5
kcal/mol, Ki
= 0.5972 μM), followed
by LTS0018034 and NCGC00385704–01 (both −8.0 kcal/mol, Ki
= 1.3890 μM), and LTS0018035 (−7.8
kcal/mol, Ki
= 1.9467 μM), all outperforming
Gemcitabine (−6.1 kcal/mol, Ki
=
34.3194 μM). Even though the reference standards perform better
than the studied compounds in binding to the EGFR target, 7ZYM does
not (Table
). However,
binding mode analyses (Figures
and ) reveal insights into the selectivity
of the molecules targeting the EGFR-T790M/C797S binding site. The
primary role of this binding site is to facilitate ATP binding and
phosphorylation, driving PI3K/AKT and MAPK/ERK signaling to promote
cell proliferation and survival. Meanwhile,
the T790M mutation increases ATP affinity, reducing the efficacy of
first-generation TKIs, while the C797S mutation prevents covalent
binding of third-generation inhibitors like Osimertinib, necessitating
novel inhibitors to block aberrant signaling. For this selected protein, molecular interactions with catalytic
residues like LYS745, MET790, MET793, SER797, and ASP800 appear critical
for the EGFR mutant selectivity, inducing
conformational changes in the binding pocket that inhibit tumor growth.
Remarkably, Figure
shows that LTS0018035 and LTS0018034 formed hydrogen bonds with
the acidic side chain of ASP800 (2.63 and 3.34 Å, respectively),
and LTS0018034 also interacted with the DFG motif residue ASP855 with
one of its carbonyl attachments.
Furthermore, NCGC00385704–01 engaged the
catalytic LYS745
through its carbonyl oxygen, similarly to known EGFR inhibitors (consistent
with the results from this study, e.g., Doxorubicin and Gemcitabine)
(Figure
). Abemaciclib formed hydrogen bonds with ASP 800
and ASP 855, alongside a π-sigma interaction with LEU718, while
Gemcitabine exhibited an unfavorable donor–donor interaction
with LYS 745, potentially explaining its lower affinity. However,
Tucatinib’s high affinity may be attributed to a hydrogen bond
from its amine bridge with the DFG motif ASP855 side chain. Hydrogen
bonding significantly influenced ligand-binding efficacy, with LTS0018035
forming the highest number of hydrogen bonds (6), followed by LTS0018034
and LTS0131924 (5 each), and NCGC00385704–01 forming 3, comparable
to some Abemaciclib, Doxorubicin, and Tucatinib (Figure
). The sandwich motif between
MET790 and LYS745 suggests strategies for improving inhibitory activity
against T790M-mutated EGFR, particularly in addressing C797S resistance
mutations prevalent in nonsmall cell lung cancer (NSCLC). We are confident that these findings will provide
a foundation for developing reversible inhibitors targeting EGFR resistance
mutations, contributing to the advancement of precision oncology.
3.9
Analysis of ADMET Properties
The
ADMET properties of the three top-hit compounds from virtual screening,
alongside four FDA-approved reference drugs, were examined using the
Swiss ADMET and ADMET Lab 2.0 web servers. The results were interpreted
based on the criteria provided by the ADMET Lab server. The oral bioavailability
of the molecules was evaluated based on Lipinski’s Rule of
Five (molecular weight (MW) ≤ 500 Da, lipophilicity index (log P) ≤ 5, hydrogen bond donors (HBDs) ≤ 5, and
hydrogen bond acceptors (HBAs) ≤ 10). Violation of more than
one of these rules typically indicates poor absorption potential.
The absorption parameters included human intestinal absorption (HIA),
solubility class, and Caco-2 permeability (C2P). Excretion parameters,
such as clearance rate and biological half-life, were also evaluated.
A summary of the pharmacokinetic properties of the compounds is shown
in Table
.
All four virtually screened active compounds (NCGC00385704–01,
LTS0018034, LTS0131923, and LTS0258251) exhibited no signs of toxicity.
They all had low log p values (<3) and high topological
polar surface area (TPSA > 75), meeting the criteria of Pfizer’s
rule, which suggests that compounds with log P >
3 and TPSA <75 are more likely to be toxic. Among the reference
drugs, Abemaciclib and Tucatinib were observed to obey Lipinski’s
rules, exhibited moderate water solubility, and showed high human
intestinal absorption, suggesting good oral bioavailability. In contrast,
Doxorubicin and Gemcitabine violated more than one of Lipinski’s
rules, with high solubility but low predicted intestinal absorption.
All three virtually screened hit compounds (LTS0018034, LTS0131923,
and LTS0258251) also violated multiple Lipinski parameters and were
predicted to have poor solubility and low intestinal absorption. Abemaciclib
and Tucatinib produced acceptable Caco-2 permeability values, which,
alongside their high HIA, indicate improved absorption properties.
Regarding excretion, clearance levels were observed to be higher for
Doxorubicin and moderate for Gemcitabine, Abemaciclib, and Tucatinib.
The three virtual hit compounds exhibited low clearance levels, with
LTS0258251 showing a longer half-life probability, which may be advantageous
for sustained therapeutic effects.
Conclusion
4
Conclusion
This study developed a novel
stacking ensemble model to differentiate
between active and inactive natural compounds targeting EGFR and HER2,
addressing the limitations of single-target therapies for colorectal
cancer. Using comprehensive data sets from the ChEMBL database and
the Lotus Natural Product database, machine learning models were optimized
to construct a robust classifier for dual receptor inhibition. This
approach bridges the gap in current therapeutic strategies, offering
a potent solution to overcome resistance mechanisms in CRC treatment.
The stacking ensemble model demonstrated outstanding predictive
performance, achieving 99.5% accuracy on the training data and 100%
accuracy, F1 score, sensitivity, specificity, Matthew’s
correlation coefficient (MCC), and area under the curve (AUC) on the
test set. By integrating outputs from 40 baseline models, the ensemble
method significantly outperformed individual models like the Multilayer
Perceptron (MLP) and Extra Trees (ET), which achieved 92.9% and 92.6%
accuracy, respectively. This superior performance highlights the effectiveness
of ensemble methods in predicting dual inhibitors for critical biological
targets.
Molecular docking analyses further validated the model’s
predictions. Among the compounds evaluated, LTS0018035 emerged as
the most potent inhibitor of HER2, with a binding energy of –
11.2 kcal/mol and an inhibition constant of 0.00626 μM. Its
interaction profile, characterized by the formation of 10 hydrogen
bonds and stabilization of key residues in the HER2 binding site,
underscores its therapeutic potential. While standard drugs such as
Tucatinib and Abemaciclib performed well against secondary HER2 targets,
the natural compounds identified in this study exhibited competitive
binding profiles, particularly for primary HER2 targets.
Importantly,
the in vitro MTT assay results validated
the model’s predictions, with the Ceratonia
siliqua extract showing selective cytotoxicity against
HCT116 cells and aligning with the virtual screening identification
of active constituents, including NCGC00385704–01.
The
dual inhibition strategy discussed in this research significantly
improves therapeutic effectiveness while laying the groundwork for
personalized cancer treatment approaches. By leveraging advanced computational
techniques alongside the chemical diversity inherent in natural products,
this study paves the way for the development of more efficient and
sustainable therapies for colorectal cancer. The receptor-focused
mechanism identified in the findings offers a credible explanation
for why the plant extract demonstrates reduced toxicity toward normal
cells in comparison to doxorubicin, which is known for its widespread
DNA-damaging effects. However, due to the chemical complexity of plant
extracts, it will be important to consider additional mechanisms,
such as apoptosis induction, oxidative stress, or cell cycle arrest,
that may also contribute to their efficacy.
Conclusion
This study developed a novel
stacking ensemble model to differentiate
between active and inactive natural compounds targeting EGFR and HER2,
addressing the limitations of single-target therapies for colorectal
cancer. Using comprehensive data sets from the ChEMBL database and
the Lotus Natural Product database, machine learning models were optimized
to construct a robust classifier for dual receptor inhibition. This
approach bridges the gap in current therapeutic strategies, offering
a potent solution to overcome resistance mechanisms in CRC treatment.
The stacking ensemble model demonstrated outstanding predictive
performance, achieving 99.5% accuracy on the training data and 100%
accuracy, F1 score, sensitivity, specificity, Matthew’s
correlation coefficient (MCC), and area under the curve (AUC) on the
test set. By integrating outputs from 40 baseline models, the ensemble
method significantly outperformed individual models like the Multilayer
Perceptron (MLP) and Extra Trees (ET), which achieved 92.9% and 92.6%
accuracy, respectively. This superior performance highlights the effectiveness
of ensemble methods in predicting dual inhibitors for critical biological
targets.
Molecular docking analyses further validated the model’s
predictions. Among the compounds evaluated, LTS0018035 emerged as
the most potent inhibitor of HER2, with a binding energy of –
11.2 kcal/mol and an inhibition constant of 0.00626 μM. Its
interaction profile, characterized by the formation of 10 hydrogen
bonds and stabilization of key residues in the HER2 binding site,
underscores its therapeutic potential. While standard drugs such as
Tucatinib and Abemaciclib performed well against secondary HER2 targets,
the natural compounds identified in this study exhibited competitive
binding profiles, particularly for primary HER2 targets.
Importantly,
the in vitro MTT assay results validated
the model’s predictions, with the Ceratonia
siliqua extract showing selective cytotoxicity against
HCT116 cells and aligning with the virtual screening identification
of active constituents, including NCGC00385704–01.
The
dual inhibition strategy discussed in this research significantly
improves therapeutic effectiveness while laying the groundwork for
personalized cancer treatment approaches. By leveraging advanced computational
techniques alongside the chemical diversity inherent in natural products,
this study paves the way for the development of more efficient and
sustainable therapies for colorectal cancer. The receptor-focused
mechanism identified in the findings offers a credible explanation
for why the plant extract demonstrates reduced toxicity toward normal
cells in comparison to doxorubicin, which is known for its widespread
DNA-damaging effects. However, due to the chemical complexity of plant
extracts, it will be important to consider additional mechanisms,
such as apoptosis induction, oxidative stress, or cell cycle arrest,
that may also contribute to their efficacy.
Supplementary Material
Supplementary Material
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.