ClusterNet: Classifying Single-Molecule Localization Microscopy Datasets with Graph-Based Deep Learning of Supracluster Structure.
1/5 보강
Single-molecule localization microscopy (SMLM) data can reveal differences in protein organization between different disease types or samples.
APA
Umney O, Slaney H, et al. (2025). ClusterNet: Classifying Single-Molecule Localization Microscopy Datasets with Graph-Based Deep Learning of Supracluster Structure.. Small science, 5(12), e202500255. https://doi.org/10.1002/smsc.202500255
MLA
Umney O, et al.. "ClusterNet: Classifying Single-Molecule Localization Microscopy Datasets with Graph-Based Deep Learning of Supracluster Structure.." Small science, vol. 5, no. 12, 2025, pp. e202500255.
PMID
41395524 ↗
Abstract 한글 요약
Single-molecule localization microscopy (SMLM) data can reveal differences in protein organization between different disease types or samples. Classification of samples is an important task that allows for automated recognition and grouping of data by sample type for downstream analysis. However, methods for classifying structures larger than single clusters of localizations in SMLM point-cloud datasets are not well developed. A graph-based deep learning pipeline is presented for classification of SMLM point-cloud data over a field of view of any size. The pipeline combines features of individual clusters (calculated from their constituent localizations) with the structure formed by the positions of multiple clusters (supracluster structure). This method outperforms previous classification results on a model open-source DNA-PAINT dataset, with 99% accuracy. It is also applied to a challenging new SMLM dataset from colorectal cancer tissue. Explainability tools Uniform Manifold Approximation and Projection and SubgraphX allow exploration of the influence of spatial features and structures on classification results, and demonstrate the importance of supracluster structure in classification.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
📖 전문 본문 읽기 PMC JATS · ~70 KB · 영문
Introduction
1
Introduction
Single‐molecule localization microscopy (SMLM) is widely used for determining protein organization at the nanoscale and has opened up new possibilities for understanding disease and other biological conditions. Unlike conventional imaging techniques, it generates a list of high‐precision coordinates (point cloud) of proteins or other labeled targets in a sample. As SMLM has become more accessible and used for a wider variety of problems, many tools have been developed to analyze these point clouds.[
1
,
2
]
Sample classification is an important step in analysis, allowing for automated recognition of sample type and downstream aggregation and analysis of data from many samples of the same type. Using deep learning (DL) algorithms for this task may also facilitate biological discovery, despite only having sample‐level labels (weakly supervised).[
3
] While DL algorithms have classified SMLM data of complex structures such as whole cells, they have first rendered the data as a pixelated image, sacrificing the full potential of the precision gain of SMLM over conventional imaging, and hence the information available.[
4
,
5
]
Existing pipelines for classifying SMLM point‐cloud datasets do not extend to classification of structures larger than single clusters of localizations. Current methods focus on classifying individual localizations and clusters, using either predetermined (handcrafted) features of clusters, such as cluster area and length,[
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
] or more abstract cluster features learnt automatically by a point‐ or graph‐based DL network.[
14
,
15
,
16
] However, the arrangement of multiple clusters in a sample (the supracluster structure), as well as the combination of the features of the different clusters, is likely to hold important biological information.
Extending algorithms using handcrafted cluster features to complex localization patterns such as multiple clusters is not straightforward, requiring new calculations for features that would discriminate between them. Point‐ and graph‐based DL pipelines have also been limited to classifying localizations and clusters, and ignore the sample‐level supracluster structure. Particle averaging has been used to classify more complex patterns than handcrafted features have typically described, but is still restricted to classification of single particles with highly consistent structure.[
17
,
18
] New methods are therefore required for the classification of variable and complex structures in SMLM data, including whole fields of view (FOVs).
Here, we present a DL classification pipeline for SMLM point‐cloud datasets that incorporates supracluster structure. We have developed ClusterNet, a graph‐based DL network, to classify graphs constructed from clusters in the localization data. The clusters in each graph are each represented by discriminative features extracted from their constituent localizations, alongside the cluster coordinates, thereby retaining the original precision of the localizations.
We demonstrate this new pipeline on a model, open‐source SMLM dataset of DNA‐PAINT localizations from DNA origami nanostructures designed to resemble a selection of Digits and Letters.[
19
] We present implementations of ClusterNet using both handcrafted features of individual clusters and features learnt with a neural network, achieving balanced classification accuracy of 99% across the dataset, improving on previous workflows.[
4
,
18
,
19
] Further, we include and demonstrate the use of DL explainability algorithms to interpret the output of ClusterNet, for a data‐driven exploration of the results.[
5
] Finally, we apply ClusterNet to a challenging new SMLM dataset from colorectal cancer tissue and compare performance using different data preprocessing algorithms and parameters. Localizations in SMLM FOVs of any spatial extent and complexity may be clustered and used as input to ClusterNet, allowing sample‐level classification from the entire point cloud.
Introduction
Single‐molecule localization microscopy (SMLM) is widely used for determining protein organization at the nanoscale and has opened up new possibilities for understanding disease and other biological conditions. Unlike conventional imaging techniques, it generates a list of high‐precision coordinates (point cloud) of proteins or other labeled targets in a sample. As SMLM has become more accessible and used for a wider variety of problems, many tools have been developed to analyze these point clouds.[
1
,
2
]
Sample classification is an important step in analysis, allowing for automated recognition of sample type and downstream aggregation and analysis of data from many samples of the same type. Using deep learning (DL) algorithms for this task may also facilitate biological discovery, despite only having sample‐level labels (weakly supervised).[
3
] While DL algorithms have classified SMLM data of complex structures such as whole cells, they have first rendered the data as a pixelated image, sacrificing the full potential of the precision gain of SMLM over conventional imaging, and hence the information available.[
4
,
5
]
Existing pipelines for classifying SMLM point‐cloud datasets do not extend to classification of structures larger than single clusters of localizations. Current methods focus on classifying individual localizations and clusters, using either predetermined (handcrafted) features of clusters, such as cluster area and length,[
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
] or more abstract cluster features learnt automatically by a point‐ or graph‐based DL network.[
14
,
15
,
16
] However, the arrangement of multiple clusters in a sample (the supracluster structure), as well as the combination of the features of the different clusters, is likely to hold important biological information.
Extending algorithms using handcrafted cluster features to complex localization patterns such as multiple clusters is not straightforward, requiring new calculations for features that would discriminate between them. Point‐ and graph‐based DL pipelines have also been limited to classifying localizations and clusters, and ignore the sample‐level supracluster structure. Particle averaging has been used to classify more complex patterns than handcrafted features have typically described, but is still restricted to classification of single particles with highly consistent structure.[
17
,
18
] New methods are therefore required for the classification of variable and complex structures in SMLM data, including whole fields of view (FOVs).
Here, we present a DL classification pipeline for SMLM point‐cloud datasets that incorporates supracluster structure. We have developed ClusterNet, a graph‐based DL network, to classify graphs constructed from clusters in the localization data. The clusters in each graph are each represented by discriminative features extracted from their constituent localizations, alongside the cluster coordinates, thereby retaining the original precision of the localizations.
We demonstrate this new pipeline on a model, open‐source SMLM dataset of DNA‐PAINT localizations from DNA origami nanostructures designed to resemble a selection of Digits and Letters.[
19
] We present implementations of ClusterNet using both handcrafted features of individual clusters and features learnt with a neural network, achieving balanced classification accuracy of 99% across the dataset, improving on previous workflows.[
4
,
18
,
19
] Further, we include and demonstrate the use of DL explainability algorithms to interpret the output of ClusterNet, for a data‐driven exploration of the results.[
5
] Finally, we apply ClusterNet to a challenging new SMLM dataset from colorectal cancer tissue and compare performance using different data preprocessing algorithms and parameters. Localizations in SMLM FOVs of any spatial extent and complexity may be clustered and used as input to ClusterNet, allowing sample‐level classification from the entire point cloud.
Results
2
Results
2.1
Classification Pipeline and Performance
To classify 2D SMLM datasets using only the xy localization coordinates, we developed a pipeline that includes the supracluster structure using graph‐based DL models, and explores the important features and substructure used in classification via Uniform Manifold Approximation and Projection (UMAP) and SubgraphX.[
20
,
21
] We tested two novel classification models, ClusterNet‐handcrafted cluster features (HCF) and ClusterNet‐learned cluster features (LCF), which combine per‐cluster features of clustered localizations with the spatial arrangement of those clusters in a graph neural network (Figure
1
). ClusterNet‐HCF passes handcrafted per‐cluster features, for example, area and perimeter, through a graph neural network, ClusterNet. ClusterNet‐LCF uses an additional point‐based DL module, LocNet, to embed per‐cluster features for subsequent classification with ClusterNet, in a single trainable network.
We tested our pipeline on regions of interest (ROIs) from an existing DNA‐PAINT dataset acquired from DNA origami Digits and Letters (Figure 1a).[
19
] We chose this dataset for the following reasons. It was large and contains point‐cloud data with the spatial coordinates and an estimate of their precision for each single‐molecule localization. It also had readily available ground truth (GT) labels making it well‐suited to training and validating a DL model. Finally, our pipeline could be compared against previous methods used to classify this dataset. The dataset additionally has features characteristic of most SMLM datasets, such as clustered localizations arranged in a nonrandom pattern and a wide range in the number of localizations and the localization precision between ROIs and classes (Table S1, Figure S1, Supporting Information). As found for more complex biological datasets, not all the localizations in each ROI contribute to the structure being analyzed or were as relevant to the GT label (e.g., background localizations or localizations far from any binding site).
Both ClusterNet‐HCF and ClusterNet‐LCF classified the data from the seven DNA origami structures in the dataset successfully. k‐means clustering was used to preprocess the localization data from the Digits and Letters (Methods), although any clustering algorithm could be applied. Both models achieved the maximum value of 1.00 for the area under the receiver operator curve (AUROC) for every class in the reserved test set. ClusterNet‐HCF outperformed ClusterNet‐LCF on the reserved test set (Table
1
) and on the training, validation and test folds (Table S2–4, Supporting Information).
ClusterNet‐HCF and ClusterNet‐LCF both outperformed previous results on the same dataset (Table
2
). In addition, the accuracy for previous methods (nanoTRON and point cloud registration) was calculated for classification within subsets (Digits/Grid or Letters) of the seven classes of DNA origami structure, whereas the accuracy of ClusterNet was from classification over all seven classes, which is a harder task.[
4
,
18
] The previous methods would be expected to give lower accuracies than reported in Table 2, if tested on the same task as ClusterNet.
Classification performance (recall) was greater for the Digits and Grid than the Letters (Table 1). This was expected as there were misfolded DNA origami structures in the Letters dataset which did not resemble the intended Letter.[
18
] This had less impact on the Digits and Grid, as the authors of the published dataset excluded some of the misfolded Digits and Grids in particle picking.[
18
] There may be an additional contribution to this from the imbalance in the validation dataset used during training (fewer Letters than Digits and Grid), despite attempts to mitigate this with weighted random sampling of the training set based on the prevalence of each class.
2.2
Feature Analysis via UMAP
The relative discriminative power of the handcrafted and deep features for the clusters was measured by comparing the separation of the clusters for each class in a 2D representation of the feature space generated using UMAP.[
20
] Isolated per‐cluster features, with no incorporation of supracluster structure, were not able to separate the classes, except for the Digit 3 (Figure
2a, d). However, the handcrafted features did slightly separate the Letters T, O, and L from Digits 1 and 2 and the Grid (Figure 2a). This reflected the differences in the DNA origami structures, where the Digits have continuous structures with no spaces between their binding sites, while the Letters have discrete structures, with groups of binding sites well‐separated from each other.[
18
] The deep cluster features generated by LocNet did not separate the Digits and Letters but better separated the Grid, suggesting that they captured different aspects of the input data to the handcrafted features (Figure 2d). Digit 3 may have been separable at the per‐cluster level because it had a significantly different localization density from the other classes (Figure S1 and Table S1, Supporting Information). However, the remaining classes could not be separated based on differences in localization density alone, as they had a similar average number of localizations per ROI (Figure S1 and Table S1, Supporting information).
The cluster features generated by ClusterNet significantly improved separation into each class, showing the importance of considering the supracluster structure via information from neighboring clusters (Figure 2b,e). Similar to the handcrafted and LocNet cluster features, the Grid and Digit 3 classes were the most clearly separated for both models. The remaining Digits, 1 and 2, were separated from the Letters, although there was significant overlap within these two groups.
The whole‐graph features further improved the separation of the classes (Figure 2c,f). This showed the importance of moving from per‐cluster to whole‐graph features, highlighted by Digits 1 and 2 which changed from overlapping to well‐separated for ClusterNet‐HCF (Figure 2b,c). ClusterNet‐HCF had the most compact and well‐separated representations, reflecting its classification performance (Figure 2c). The Letters and Digits 1 and 2 were still assigned into two distinct groups.
2.3
Structure Analysis via SubgraphX
Different classes may be distinguished by the arrangement of some or all of their clusters (supracluster structure). We tested a method for identifying these structures by finding the important parts of a cluster graph for the classification of an SMLM ROI. SubgraphX searches for the most important subgraph for the graph classification,[
21
] in this case a subset of clusters and their supracluster structure. We analyzed the classification results from ClusterNet‐HCF (the best performing model).
The 2D representation of the whole‐graph feature space (Figure 2c) allowed us to choose ROIs closest to and furthest from the rest of their class members, measured by the distance to the centroid of the class features (e.g., Figure
3a: ROI1, ROI2). After graph representation and classification (Figure 3b,d), SubgraphX returned the most important subgraph for the classification that it found (Figure 3c,e; for parameters see Experimental Section in Supporting Information). Positive and negative fidelity metrics (Fid+, Fid‐) measured the necessity and sufficiency, respectively, of the subgraph for the classification (Methods; best performance: Fid+ = 1, Fid− = 0; worst performance: Fid+ = 0 and Fid− = 1).[
22
]
An exemplary graph closest to its class members (Figure 3a: ROI1) was from a clear DNA‐PAINT example of a Digit 3, correctly predicted (Figure 3b). The subgraph identified by SubgraphX was both necessary and sufficient for the classification (Fid+: 1.0, Fid−: 2.6 × 10−5) and reflected the Digit 3 shape that would be expected (Figure 3c). An example furthest from its class members (Figure 3a: ROI2) may have been a misfolded item, as there only appear to be three well‐separated groups of localizations whereas the Letter O (the GT label) should have four (Figure 3d and Figure 1a).[
18
] It was incorrectly predicted as the Letter T, which was reflected in its location near the other Letter Ts in the UMAP representation (Figure 3a, d). The subgraph extracted by SubgraphX was important for its classification (Fid+: 0.96, Fid−: 2.5 × 10−2) and resembled the Letter T structure (Figure 3e and Figure 1a). In general, the important subgraph for incorrectly classified ROIs did not appear to reflect the structure expected of the correct class (Figure S2, S3, Supporting Information), and a closer resemblance to the incorrectly predicted class could sometimes be discerned (Figure S2h,l,m,n and Figure S3b, Supporting Information). This suggests that this approach could allow us to begin to identify specific patterns in the supracluster structure that lead to classification results in SMLM datasets, including misclassifications (further results in Figure S4, Table S7, and Table S8, Supporting Information).
We also note that some clusters appeared outside of the designed DNA origami patterns, arising from imperfections in the original sample and raw data acquisition, influencing classification with ClusterNet (Figure S2a,e,h and Figure S3c, Supporting Information). SubgraphX showed that these spurious clusters were sometimes considered important in classification (Figure S2a,h, Supporting Information), also explaining some incorrect results (Figure S2h, Supporting Information).
2.4
Classification of Direct Stochastic Optical Reconstruction Microscopy (dSTORM) Biological Dataset
To determine if this pipeline can be used to classify more complex biological datasets and other SMLM techniques, we tested its ability to classify a dataset of dSTORM localizations for epiregulin (EREG) in tissue sections. The dataset was obtained by imaging cells (n = 163) stained for EREG in tissue microarrays (TMAs) from colorectal cancer patients (n = 23) (Figure
4
, Figure S5 and Table S9, Supporting Information).[
23
,
24
] The TMAs were obtained from patients classified by their response (no‐response vs. any‐response) to anti‐epidermal growth factor receptor (EGFR) treatment. This dataset is more challenging to classify, because the localizations have a lower signal‐to‐noise ratio than the Digits and Letters dataset, and the classes are not visually identifiable from reconstructed images of the cells.
The ClusterNet‐HCF pipeline successfully classified the cells. After changing the number of clusters identified in preprocessing from k = 12 to k = 96, the AUROC reached 0.70 ± 0.15 and balanced accuracy was 0.64 ± 0.14. This suggests that the spatial organization of the ligand EREG may help predict response to anti‐EGFR treatment.
We also investigated the effect of the clustering algorithm and parameters in preprocessing on classification results for this dataset. k‐means and density‐based spatial clustering of applications with noise (DBSCAN) were compared over a wide range of parameters (Table S10, Supporting Information). For k‐means, both AUROC and balanced accuracy improved when k = 12 was increased to k = 96, showing the importance of fine‐tuning the graph representation for each task. DBSCAN, with ε = 50 nm and minPts = 3, achieved the same AUROC as k‐means with k = 96. However, the variance in AUROC was greater between the test folds (greater uncertainty in performance on unseen test data) and balanced accuracy was decreased. We also found that k‐means was more robust to changes in parameters than DBSCAN, which had large differences in results for small changes to ε and minPts. Of note, removing some of the preprocessing steps such as filtering localizations and cells made little difference to results (see Supporting Information).
Results
2.1
Classification Pipeline and Performance
To classify 2D SMLM datasets using only the xy localization coordinates, we developed a pipeline that includes the supracluster structure using graph‐based DL models, and explores the important features and substructure used in classification via Uniform Manifold Approximation and Projection (UMAP) and SubgraphX.[
20
,
21
] We tested two novel classification models, ClusterNet‐handcrafted cluster features (HCF) and ClusterNet‐learned cluster features (LCF), which combine per‐cluster features of clustered localizations with the spatial arrangement of those clusters in a graph neural network (Figure
1
). ClusterNet‐HCF passes handcrafted per‐cluster features, for example, area and perimeter, through a graph neural network, ClusterNet. ClusterNet‐LCF uses an additional point‐based DL module, LocNet, to embed per‐cluster features for subsequent classification with ClusterNet, in a single trainable network.
We tested our pipeline on regions of interest (ROIs) from an existing DNA‐PAINT dataset acquired from DNA origami Digits and Letters (Figure 1a).[
19
] We chose this dataset for the following reasons. It was large and contains point‐cloud data with the spatial coordinates and an estimate of their precision for each single‐molecule localization. It also had readily available ground truth (GT) labels making it well‐suited to training and validating a DL model. Finally, our pipeline could be compared against previous methods used to classify this dataset. The dataset additionally has features characteristic of most SMLM datasets, such as clustered localizations arranged in a nonrandom pattern and a wide range in the number of localizations and the localization precision between ROIs and classes (Table S1, Figure S1, Supporting Information). As found for more complex biological datasets, not all the localizations in each ROI contribute to the structure being analyzed or were as relevant to the GT label (e.g., background localizations or localizations far from any binding site).
Both ClusterNet‐HCF and ClusterNet‐LCF classified the data from the seven DNA origami structures in the dataset successfully. k‐means clustering was used to preprocess the localization data from the Digits and Letters (Methods), although any clustering algorithm could be applied. Both models achieved the maximum value of 1.00 for the area under the receiver operator curve (AUROC) for every class in the reserved test set. ClusterNet‐HCF outperformed ClusterNet‐LCF on the reserved test set (Table
1
) and on the training, validation and test folds (Table S2–4, Supporting Information).
ClusterNet‐HCF and ClusterNet‐LCF both outperformed previous results on the same dataset (Table
2
). In addition, the accuracy for previous methods (nanoTRON and point cloud registration) was calculated for classification within subsets (Digits/Grid or Letters) of the seven classes of DNA origami structure, whereas the accuracy of ClusterNet was from classification over all seven classes, which is a harder task.[
4
,
18
] The previous methods would be expected to give lower accuracies than reported in Table 2, if tested on the same task as ClusterNet.
Classification performance (recall) was greater for the Digits and Grid than the Letters (Table 1). This was expected as there were misfolded DNA origami structures in the Letters dataset which did not resemble the intended Letter.[
18
] This had less impact on the Digits and Grid, as the authors of the published dataset excluded some of the misfolded Digits and Grids in particle picking.[
18
] There may be an additional contribution to this from the imbalance in the validation dataset used during training (fewer Letters than Digits and Grid), despite attempts to mitigate this with weighted random sampling of the training set based on the prevalence of each class.
2.2
Feature Analysis via UMAP
The relative discriminative power of the handcrafted and deep features for the clusters was measured by comparing the separation of the clusters for each class in a 2D representation of the feature space generated using UMAP.[
20
] Isolated per‐cluster features, with no incorporation of supracluster structure, were not able to separate the classes, except for the Digit 3 (Figure
2a, d). However, the handcrafted features did slightly separate the Letters T, O, and L from Digits 1 and 2 and the Grid (Figure 2a). This reflected the differences in the DNA origami structures, where the Digits have continuous structures with no spaces between their binding sites, while the Letters have discrete structures, with groups of binding sites well‐separated from each other.[
18
] The deep cluster features generated by LocNet did not separate the Digits and Letters but better separated the Grid, suggesting that they captured different aspects of the input data to the handcrafted features (Figure 2d). Digit 3 may have been separable at the per‐cluster level because it had a significantly different localization density from the other classes (Figure S1 and Table S1, Supporting Information). However, the remaining classes could not be separated based on differences in localization density alone, as they had a similar average number of localizations per ROI (Figure S1 and Table S1, Supporting information).
The cluster features generated by ClusterNet significantly improved separation into each class, showing the importance of considering the supracluster structure via information from neighboring clusters (Figure 2b,e). Similar to the handcrafted and LocNet cluster features, the Grid and Digit 3 classes were the most clearly separated for both models. The remaining Digits, 1 and 2, were separated from the Letters, although there was significant overlap within these two groups.
The whole‐graph features further improved the separation of the classes (Figure 2c,f). This showed the importance of moving from per‐cluster to whole‐graph features, highlighted by Digits 1 and 2 which changed from overlapping to well‐separated for ClusterNet‐HCF (Figure 2b,c). ClusterNet‐HCF had the most compact and well‐separated representations, reflecting its classification performance (Figure 2c). The Letters and Digits 1 and 2 were still assigned into two distinct groups.
2.3
Structure Analysis via SubgraphX
Different classes may be distinguished by the arrangement of some or all of their clusters (supracluster structure). We tested a method for identifying these structures by finding the important parts of a cluster graph for the classification of an SMLM ROI. SubgraphX searches for the most important subgraph for the graph classification,[
21
] in this case a subset of clusters and their supracluster structure. We analyzed the classification results from ClusterNet‐HCF (the best performing model).
The 2D representation of the whole‐graph feature space (Figure 2c) allowed us to choose ROIs closest to and furthest from the rest of their class members, measured by the distance to the centroid of the class features (e.g., Figure
3a: ROI1, ROI2). After graph representation and classification (Figure 3b,d), SubgraphX returned the most important subgraph for the classification that it found (Figure 3c,e; for parameters see Experimental Section in Supporting Information). Positive and negative fidelity metrics (Fid+, Fid‐) measured the necessity and sufficiency, respectively, of the subgraph for the classification (Methods; best performance: Fid+ = 1, Fid− = 0; worst performance: Fid+ = 0 and Fid− = 1).[
22
]
An exemplary graph closest to its class members (Figure 3a: ROI1) was from a clear DNA‐PAINT example of a Digit 3, correctly predicted (Figure 3b). The subgraph identified by SubgraphX was both necessary and sufficient for the classification (Fid+: 1.0, Fid−: 2.6 × 10−5) and reflected the Digit 3 shape that would be expected (Figure 3c). An example furthest from its class members (Figure 3a: ROI2) may have been a misfolded item, as there only appear to be three well‐separated groups of localizations whereas the Letter O (the GT label) should have four (Figure 3d and Figure 1a).[
18
] It was incorrectly predicted as the Letter T, which was reflected in its location near the other Letter Ts in the UMAP representation (Figure 3a, d). The subgraph extracted by SubgraphX was important for its classification (Fid+: 0.96, Fid−: 2.5 × 10−2) and resembled the Letter T structure (Figure 3e and Figure 1a). In general, the important subgraph for incorrectly classified ROIs did not appear to reflect the structure expected of the correct class (Figure S2, S3, Supporting Information), and a closer resemblance to the incorrectly predicted class could sometimes be discerned (Figure S2h,l,m,n and Figure S3b, Supporting Information). This suggests that this approach could allow us to begin to identify specific patterns in the supracluster structure that lead to classification results in SMLM datasets, including misclassifications (further results in Figure S4, Table S7, and Table S8, Supporting Information).
We also note that some clusters appeared outside of the designed DNA origami patterns, arising from imperfections in the original sample and raw data acquisition, influencing classification with ClusterNet (Figure S2a,e,h and Figure S3c, Supporting Information). SubgraphX showed that these spurious clusters were sometimes considered important in classification (Figure S2a,h, Supporting Information), also explaining some incorrect results (Figure S2h, Supporting Information).
2.4
Classification of Direct Stochastic Optical Reconstruction Microscopy (dSTORM) Biological Dataset
To determine if this pipeline can be used to classify more complex biological datasets and other SMLM techniques, we tested its ability to classify a dataset of dSTORM localizations for epiregulin (EREG) in tissue sections. The dataset was obtained by imaging cells (n = 163) stained for EREG in tissue microarrays (TMAs) from colorectal cancer patients (n = 23) (Figure
4
, Figure S5 and Table S9, Supporting Information).[
23
,
24
] The TMAs were obtained from patients classified by their response (no‐response vs. any‐response) to anti‐epidermal growth factor receptor (EGFR) treatment. This dataset is more challenging to classify, because the localizations have a lower signal‐to‐noise ratio than the Digits and Letters dataset, and the classes are not visually identifiable from reconstructed images of the cells.
The ClusterNet‐HCF pipeline successfully classified the cells. After changing the number of clusters identified in preprocessing from k = 12 to k = 96, the AUROC reached 0.70 ± 0.15 and balanced accuracy was 0.64 ± 0.14. This suggests that the spatial organization of the ligand EREG may help predict response to anti‐EGFR treatment.
We also investigated the effect of the clustering algorithm and parameters in preprocessing on classification results for this dataset. k‐means and density‐based spatial clustering of applications with noise (DBSCAN) were compared over a wide range of parameters (Table S10, Supporting Information). For k‐means, both AUROC and balanced accuracy improved when k = 12 was increased to k = 96, showing the importance of fine‐tuning the graph representation for each task. DBSCAN, with ε = 50 nm and minPts = 3, achieved the same AUROC as k‐means with k = 96. However, the variance in AUROC was greater between the test folds (greater uncertainty in performance on unseen test data) and balanced accuracy was decreased. We also found that k‐means was more robust to changes in parameters than DBSCAN, which had large differences in results for small changes to ε and minPts. Of note, removing some of the preprocessing steps such as filtering localizations and cells made little difference to results (see Supporting Information).
Discussion
3
Discussion
We have demonstrated a pipeline for classifying SMLM fields of view, combining per‐cluster features extracted from the localization pattern with supracluster structure extracted via graph‐based DL (ClusterNet, Figure 1). Per‐cluster features may either be handcrafted and fully interpretable (e.g., cluster area, perimeter) or also obtained from DL (LocNet). We have also incorporated analysis of the features and structures learnt during classification. UMAP allowed investigation of cluster and supracluster structure feature embedding.[
20
] It demonstrated that supracluster structure was required to separate the classes in the Digits and Letters dataset and revealed subpopulations such as Digit vs Letter structures without user specification (Figure 2). SubgraphX identified the important subgraphs (subsets of clusters in their supracluster arrangements) for classification (Figure 3).[
21
]
Combining handcrafted features (HCF) with ClusterNet (ClusterNet‐HCF) outperformed the model that extracted deep features at both the per‐cluster and whole‐graph scale (ClusterNet‐LCF) (Table 1, Figure 2c, f). This shows that incorporating known handcrafted features can boost classification performance, besides being more interpretable. In other SMLM datasets, deriving discriminative handcrafted features may be more difficult. We anticipate that generating cluster features via DL, as with LocNet, will be more useful in these cases. Further, extending ClusterNet‐LCF to include photophysical parameters as features of the localizations, or weighting localizations by these parameters in the calculation of handcrafted features, could allow the model to account for localization uncertainty.
Our classification pipeline outperformed previous methods applied to the Digits and Letters dataset (Table 2). One was designed for particle fusion and requires no training and minimal supervision,[
18
] and the other was an image‐based model.[
4
] It is likely that an alternative image‐based method could match our classification performance, but it would require large images (e.g. 1 pixel per nm) to incorporate the high precision of the SMLM data as the localizations are binned into pixels.[
14
,
25
] Such rendering does not scale well for larger ROIs or when moving from 2D to 3D SMLM data, as the size of the image increases exponentially.[
14
] Instead, point‐based methods such as ours can be simply extended without a large increase in the size of the data representation. For ClusterNet, this would include changing the handcrafted features to characterize 3D rather than 2D shapes (e.g., area to volume), changing the dimension of the position vector for each localization and node from 2D to 3D and including 3D rotations during training to learn invariance to rotations not in the optical plane.
Analyzing high dimensional features of SMLM data can help to identify and investigate the underlying substructures that distinguish different SMLM data structures. This has mainly been restricted to visualization of the features in a lower‐dimensional space, via dimension reduction techniques such as UMAP.[
9
,
10
,
14
,
26
] For the Digits and Letters dataset, this revealed intraclass variability and identified misfolded structures by visualizing graphs closest to and furthest from the center of the class (Figure 3 and Figure S2, Supporting Information), without requiring an additional class to capture the misfolds.[
18
]
DL explainability algorithms can more directly identify the underlying substructures, by measuring their impact on the model's prediction. So far this has been used on SMLM data represented as images, but not on graph representations.[
5
] For the Digits and Letters dataset, SubgraphX was used to identify the discriminative substructure in each FOV represented as a graph. In some cases, this was able to identify substructure that aligned with our knowledge of the GT (Figure 3). However, future work is required to make it more reliable, if this is to be applied to other SMLM data.
In this study, the pipeline was demonstrated on DNA‐PAINT and dSTORM data, but the pipeline could be applied to a broad range of SMLM data from other techniques that generate point‐cloud data (e.g., PALM) photoactivated localization microscopy. Our datasets had characteristics common to most SMLM data (e.g., being clustered or containing localizations that are less relevant to the classification task), making them a useful proof of concept. Moving between these different techniques or experimental conditions should not significantly affect handcrafted features that capture the overall shape and size of clusters. Learnt cluster features can adapt to any differences, as seen by the range of data types (e.g. 3D shapes, indoor and outdoor scenes etc.) that point‐based DL networks have been applied to.[
27
] The model can then be retrained or adjusted to adapt to any remaining differences, for example, increasing the number of clusters improved the accuracy of ClusterNet for the dSTORM data from cells. To further improve the classification performance on the cells, other modifications to the pipeline could be tested, such as: including more handcrafted features, increasing the number of layers in the network, changing how the localizations and cells are filtered during preprocessing, or including another level of clusters (localizations, clusters, super‐clusters) to better capture larger structures in the data.
ClusterNet can be applied to complex biological samples as we show here. ClusterNet was able to classify tumor samples into response and no response to treatment. The drop in classification performance compared to the Digits and Letters reflects the challenge of classifying complex biological samples, with lower signal‐to‐noise ratio and with fewer ROIs available for model testing. Results on the tumor samples also highlighted factors that may impact classification performance, such as how the data is preprocessed or the choice of graph representation and clustering technique. Furthermore, the TMAs contain multiple cell types, some of which may not change their expression or organization of EREG in cancer (e.g., lymphocyte, enterocyte, etc.), and which could not be identified from the dSTORM data. These cells could therefore not be excluded from the annotation stage, which is likely to have affected the overall classification performance metrics. With further development, this pipeline could be used to differentiate between disease conditions or biological states from SMLM data on cells and tissues, refining current techniques in pathology that use ligand expression.[
28
,
29
,
30
] ClusterNet can therefore contribute to realizing the use of SMLM data as a new modality in characterization of phenotype, classification of disease progression and prediction of response to treatment.[
31
,
32
,
33
,
34
] We also provide a new route to increased understanding of disease and other biological processes through downstream feature and structure analysis.
Discussion
We have demonstrated a pipeline for classifying SMLM fields of view, combining per‐cluster features extracted from the localization pattern with supracluster structure extracted via graph‐based DL (ClusterNet, Figure 1). Per‐cluster features may either be handcrafted and fully interpretable (e.g., cluster area, perimeter) or also obtained from DL (LocNet). We have also incorporated analysis of the features and structures learnt during classification. UMAP allowed investigation of cluster and supracluster structure feature embedding.[
20
] It demonstrated that supracluster structure was required to separate the classes in the Digits and Letters dataset and revealed subpopulations such as Digit vs Letter structures without user specification (Figure 2). SubgraphX identified the important subgraphs (subsets of clusters in their supracluster arrangements) for classification (Figure 3).[
21
]
Combining handcrafted features (HCF) with ClusterNet (ClusterNet‐HCF) outperformed the model that extracted deep features at both the per‐cluster and whole‐graph scale (ClusterNet‐LCF) (Table 1, Figure 2c, f). This shows that incorporating known handcrafted features can boost classification performance, besides being more interpretable. In other SMLM datasets, deriving discriminative handcrafted features may be more difficult. We anticipate that generating cluster features via DL, as with LocNet, will be more useful in these cases. Further, extending ClusterNet‐LCF to include photophysical parameters as features of the localizations, or weighting localizations by these parameters in the calculation of handcrafted features, could allow the model to account for localization uncertainty.
Our classification pipeline outperformed previous methods applied to the Digits and Letters dataset (Table 2). One was designed for particle fusion and requires no training and minimal supervision,[
18
] and the other was an image‐based model.[
4
] It is likely that an alternative image‐based method could match our classification performance, but it would require large images (e.g. 1 pixel per nm) to incorporate the high precision of the SMLM data as the localizations are binned into pixels.[
14
,
25
] Such rendering does not scale well for larger ROIs or when moving from 2D to 3D SMLM data, as the size of the image increases exponentially.[
14
] Instead, point‐based methods such as ours can be simply extended without a large increase in the size of the data representation. For ClusterNet, this would include changing the handcrafted features to characterize 3D rather than 2D shapes (e.g., area to volume), changing the dimension of the position vector for each localization and node from 2D to 3D and including 3D rotations during training to learn invariance to rotations not in the optical plane.
Analyzing high dimensional features of SMLM data can help to identify and investigate the underlying substructures that distinguish different SMLM data structures. This has mainly been restricted to visualization of the features in a lower‐dimensional space, via dimension reduction techniques such as UMAP.[
9
,
10
,
14
,
26
] For the Digits and Letters dataset, this revealed intraclass variability and identified misfolded structures by visualizing graphs closest to and furthest from the center of the class (Figure 3 and Figure S2, Supporting Information), without requiring an additional class to capture the misfolds.[
18
]
DL explainability algorithms can more directly identify the underlying substructures, by measuring their impact on the model's prediction. So far this has been used on SMLM data represented as images, but not on graph representations.[
5
] For the Digits and Letters dataset, SubgraphX was used to identify the discriminative substructure in each FOV represented as a graph. In some cases, this was able to identify substructure that aligned with our knowledge of the GT (Figure 3). However, future work is required to make it more reliable, if this is to be applied to other SMLM data.
In this study, the pipeline was demonstrated on DNA‐PAINT and dSTORM data, but the pipeline could be applied to a broad range of SMLM data from other techniques that generate point‐cloud data (e.g., PALM) photoactivated localization microscopy. Our datasets had characteristics common to most SMLM data (e.g., being clustered or containing localizations that are less relevant to the classification task), making them a useful proof of concept. Moving between these different techniques or experimental conditions should not significantly affect handcrafted features that capture the overall shape and size of clusters. Learnt cluster features can adapt to any differences, as seen by the range of data types (e.g. 3D shapes, indoor and outdoor scenes etc.) that point‐based DL networks have been applied to.[
27
] The model can then be retrained or adjusted to adapt to any remaining differences, for example, increasing the number of clusters improved the accuracy of ClusterNet for the dSTORM data from cells. To further improve the classification performance on the cells, other modifications to the pipeline could be tested, such as: including more handcrafted features, increasing the number of layers in the network, changing how the localizations and cells are filtered during preprocessing, or including another level of clusters (localizations, clusters, super‐clusters) to better capture larger structures in the data.
ClusterNet can be applied to complex biological samples as we show here. ClusterNet was able to classify tumor samples into response and no response to treatment. The drop in classification performance compared to the Digits and Letters reflects the challenge of classifying complex biological samples, with lower signal‐to‐noise ratio and with fewer ROIs available for model testing. Results on the tumor samples also highlighted factors that may impact classification performance, such as how the data is preprocessed or the choice of graph representation and clustering technique. Furthermore, the TMAs contain multiple cell types, some of which may not change their expression or organization of EREG in cancer (e.g., lymphocyte, enterocyte, etc.), and which could not be identified from the dSTORM data. These cells could therefore not be excluded from the annotation stage, which is likely to have affected the overall classification performance metrics. With further development, this pipeline could be used to differentiate between disease conditions or biological states from SMLM data on cells and tissues, refining current techniques in pathology that use ligand expression.[
28
,
29
,
30
] ClusterNet can therefore contribute to realizing the use of SMLM data as a new modality in characterization of phenotype, classification of disease progression and prediction of response to treatment.[
31
,
32
,
33
,
34
] We also provide a new route to increased understanding of disease and other biological processes through downstream feature and structure analysis.
Experimental Section
4
Experimental Section
4.14.1.1
Digits and Letters Dataset
To develop and test our pipeline we used the open‐source Digits and Letters dataset from.[
19
] Further details on how these datasets were acquired can be found at[
4
] for the Digits and Grid and at[
18
] for the Letters. The dataset comprises 22 047 SMLM ROIs from DNA‐PAINT data acquired from DNA origami structures, with one structure per ROI. This included Digits (4155 × Digit 1, 4943 × Digit 2, 2541 × Digit 3), Letters (1161 × Letter L, 991 × Letter T, 560 × Letter O), and a 3 × 4 grid (7696 × Grid), imaged separately per class.[
4
,
18
] The number of localizations per ROI and localization uncertainty for each class are visualized and characterized in Figure S1 and Table S1, Supporting Information.
Tumor Dataset from the PICCOLO Trial
PICCOLO was a randomized phase III trial of second‐line irinotecan with or without panitumumab (anti‐EGFR therapy) in advanced colorectal cancer.[
23
,
24
] As part of the trial, pretreatment tumor samples were collected from each patient (mainly from the primary tumors but with ≈5% from metastases). Best response by Response Evaluation Criteria In Solid Tumors (RECIST) criteria within 1‐year follow‐up from randomization was recorded.[
24
,
35
]
In our study, we imaged cells in tumor cores from the preconstructed formalin‐fixed paraffin‐embedded (FFPE) TMA blocks from PICCOLO, preferred over whole slides to minimize time identifying tumor regions and increase throughput. Only patients who received anti‐EGFR therapy (panitumumab arm) and tested wild‐type (WT) for confounding genetic mutations were included (mutations in KRAS codons 12, 13, 61, and 146; NRAS codons 12, 13, and 61; and BRAF codon 1799 T > A). Patients with BRAF‐mutant tumors were excluded because anti‐EGFR therapy is less effective for this subgroup.[
36
] The best response variables (from RECIST) were grouped into any‐response (complete response, partial response) and no‐response (radiological progression, clinical progression, death), with patients that only achieved stable disease excluded to simplify the problem.
Sample Preparation and SMLM Imaging for the Tumor Dataset
3 μm sections were cut from each preconstructed FFPE TMA block and placed on APES‐coated 1.5 H coverslips. Coverslips were placed on a hotplate at 70 °C for 30 min and then inside a pressure cooker with Reveal Decloaker (pH 6.0) for antigen retrieval. At room temperature, samples were then quenched with ammonium chloride for 15 min, permeabilized with 0.5% Triton‐X in (PBS) phosphate buffered saline for 1 h and blocked with 3% bovine serum albumin (BSA) plus 20% donkey serum for 1 h. Samples were then incubated with Roche RTU anti‐EREG antibody (SP326: rabbit monoclonal, Roche) overnight at 4 °C and then with donkey anti‐rabbit Alexa 647 for 1 h at room temperature.
Each section contained samples from up to ≈30 patients, with three tumor cores (preselected from the region of highest tumor) per patient. For each patient included in this study, the core with the highest tumor content (greatest area of anti‐EREG staining in widefield scans) was selected for imaging. One FOV was imaged at the region of greatest staining within the core.
To image the samples, we performed total internal reflection fluorescence microscopy dSTORM imaging using a commercial system (Nanoimager, ONI, UK) and a 100 × 1.4 NA oil‐immersion objective lens with a 50 × 80 μm FOV. Samples were bathed in STORM buffer (B‐cubed buffer, ONI, BCA0017). Using an exposure time of 30 ms, 10 000 frames were acquired using 640 nm laser excitation at 80% power of the maximum excitation output of the Nanoimager. 2D localization of fluorescence emission events was performed while imaging using NimOS (ONI).
Drift correction, filtering, and temporal grouping for each FOV were performed using CODI (COllaborative DIscovery platform from ONI; https://oni.bio/nanoimager/software/codi‐software/). Localizations with >30 000 photons, with a standard deviation of the fitted point spread function (PSF) <75 nm or >200 nm, with a p‐value for the fitted PSF above 0.01 or with a localization precision >25 nm were removed. Localizations within 60 nm and no more than two frames apart were grouped, removing those that existed for longer than five frames. The data in its proprietary format was then converted into an Apache Parquet file, a column‐oriented data format which can be more efficient for querying and storing than.csv files (https://parquet.apache.org/). For each localization, the channel, frame number, and xy coordinates were stored.[
37
]
Preprocessing and Clustering the Digits and Letters Dataset
Each ROI was preprocessed and clustered in preparation for feature extraction and graph representation as follows. The point cloud was initially partitioned into clusters following a similar approach to PointTransformer V2.[
38
] First, the xy localization coordinates for each ROI were converted from a MATLAB to an Apache Parquet file and labeled in the metadata according to the character they represent. This gave seven classes (Digit 1, Digit 2, Digit 3, Letter T, Letter O, Letter L, or Grid). Next, k‐means clustering was applied to each ROI with k set to 12 to ensure every well‐separated group of DNA‐PAINT binding sites was recovered (Grid had 12 well‐separated binding sites; Digits had more binding sites, but not well‐separated). Clusters with two or fewer localizations were discarded to allow future calculation of the convex hull and principal components. We chose k‐means clustering over DBSCAN, as DBSCAN requires careful tuning of two hyperparameters rather than one and risks dropping many important localizations considered as noise.[
39
]
Cell Annotation, Preprocessing, and Clustering
Cells within FOVs were annotated and their localizations extracted using locpix.[
40
] First, the unfiltered localizations were rendered into a histogram to facilitate annotation. The membranes of cells with low intensity at their center and high intensity at their membrane were annotated, generating a membrane mask. Annotations were either closed with themselves or the edge of the FOV. Cells were ≈5–15 μm in diameter and could be at the edge of the FOV (only partially visible). The cell‐containing region was generated by flood‐filling the membrane mask at seed locations manually placed at the cell centers. The watershed algorithm was then applied to the membrane mask, over the cell‐containing region, to generate the individual cell masks. Having first checked these annotations still correctly overlaid the higher quality localizations (with drift correction, filtering, and temporal grouping), the cell masks and membrane annotations were used to extract the higher quality localizations for each cell and label each localization according to its location (membrane or interior). All localizations not part of a cell were removed. Cells with fewer than 500 localizations, or with fewer than five interior or membrane localizations, were removed. The xy localization coordinates for each cell were saved as an Apache Parquet file with a label in the metadata for the treatment response (any‐response or no‐response). This gave 163 cells (no‐response: 117, any‐response: 46) from 23 patients (no‐response: 18, any‐response: 5) in which the numbers of cells per patient were unequal, ranging from 1–31 cells per patient (Figure S5, Supporting Information).
Each cell was clustered in preparation for feature extraction and graph representation. We trialed k‐means clustering, with k = 12, 24, 48, 72, 96, 120, or 144, and DBSCAN clustering, with epsilon, ε = 50 nm, 75 nm, or 100 nm and minimum samples, minPts = 3, 5, or 7. The DBSCAN parameters were set to similar values as used in previous studies to identify clusters of EGFR in SMLM data.[
40
,
41
] Clusters with two or fewer localizations were discarded to allow calculation of the convex hull and principal components.
Handcrafted Feature Extraction
Eight handcrafted features were calculated for each cluster. First, the number of localizations per‐cluster (count), the mean squared distances of localizations within the cluster from the cluster centroid (radius of gyration squared) and the perimeter from its convex hull were calculated. Next, principal component analysis was used to calculate the variance of the clusters along the two principal components, λ0 and λ1, where λ0>λ1. These were used to calculate linearity (λ0 − λ1 λ0 ), planarity (λ1λ0), length (2.35×λ0), and area (2.352×λ0λ1)of each cluster (the full width at half maximum of a Gaussian is given by 2.35 times the standard deviation).[
42
,
43
] In 3D, planarity is given by λ1−λ2λ0, where λ2 is the third principal component.[
42
] The data were 2D in this study, so λ2 = 0 and linearity = 1−planarity, which made one of linearity or planarity redundant. However, nothing is lost by including the redundant features in 2D, except a small increase in memory usage and training time. Finally, density for each cluster was calculated by dividing count by area.
Graph Construction
Each SMLM ROI (Digits and Letters) or cell (tumor dataset) was represented as a graph using PyTorch Geometric.[
44
] Each graph contained localization and cluster nodes (from clustering as described), where each localization node belonged to its cluster node and undirected edges connected each cluster node to itself and its nearest five neighbors. The position of each localization node was given by its coordinates and the position of each cluster node was given by the center of mass of its constituent localizations. xy node positions were normalized to between −1 and 1, x
→2( x−min(x))max(xrange,yrange)−1 and y
→2( y−min(y))max(xrange,yrange)−1, where the minimum and range were measured over the parent graph. Cluster nodes initially had either no features or the eight handcrafted features depending on the downstream model (learnt or handcrafted features). When present, these features, h, were normalized to between zero and one, h→h−min(h)max(h)− min(h), where the minimum and maximum values were measured over the whole training set for the relevant dataset. The localization nodes had no input features.
Data Partitions
For the Digits and Letters dataset, 20 367 graphs (Digit 1: 3915, Digit 2: 4703, Digit 3: 2301, Letter L: 921, Letter T: 751, Letter O: 320, Grid: 7456) were used for five‐fold cross‐validation. A further 240 graphs from each class formed a reserved test set. The five different splits in cross‐validation each contained a training (64%), validation (16%), and test set (20%, nonoverlapping between splits). For each split, the training, validation, and test set each had the same proportion of classes as the overall cross‐validation dataset.
For the cell dataset, the cells were partitioned for five‐fold cross‐validation of model performance. The cell dataset was first divided into five folds (groups of cells). Each fold was used for testing, with the remaining four folds used for training. During training, 20% of this training set formed a validation set, where the ratio of any‐response to no‐response cells in the validation set was the same as in the training set, or as close as possible. This was used to monitor performance on data not used to train the model. The cells from a single patient could only be in one of the folds, to ensure that none of the models were trained and tested on cells from the same patient. Further, as best as possible, the folds had the same ratio of any‐response to no‐response cells as in the overall dataset.
Model Architecture
Two different neural network models were developed to classify each graph, ClusterNet‐HCF and ClusterNet‐LCF. ClusterNet‐HCF passed HCF, as described, and the positions of the cluster nodes, through a novel graph neural network, ClusterNet. ClusterNet‐LCF instead generated cluster features via DL using an additional point‐based module, LocNet, and passed them, with cluster position, through ClusterNet in a single network. Brief descriptions of the models are given below, with further details in Methods and Figure S6 in Supporting Information.
LocNet acted on each cluster independently, taking the constituent localization node positions, p∈ℝ2, as input and embedding the localizations into a feature vector (length eight) for each cluster using PointTransformer v1.[
45
,
46
] The feature vector was chosen to be length eight to allow a fair comparison between ClusterNet‐HCF, which used an eight‐dimensional input feature, and ClusterNet‐LCF. Increasing the dimension of this feature vector may allow ClusterNet‐LCF to represent and classify more complex structures. Output cluster node features from LocNet were then scaled to between 0 and 1 using the sigmoid function.
ClusterNet acted on the graph of cluster nodes with their feature vectors and their xy position. It was composed of four message passing layers with PointTransformer convolutions, a global maximum pooling layer over the cluster features, resulting in a feature vector for the graph, and a final linear layer, generating a class prediction. When classifying the cells, the number of output channels for the ClusterNet was changed from 7 to 2 (Figure S6, Supporting information).
Training Procedure
For each split in five‐fold cross‐validation, the model was trained on the training dataset for 100 epochs using the ADAM optimizer with a learning rate of 0.001, a weight decay of 0.0001, and a batch size of 128 for the Digits and Letters or eight for the cells.[
47
] During training, a weighted random sampler that oversampled from the minority class and undersampled from the majority class was used to encourage equal performance across the classes. Further, random rotations in the xy plane were applied to the graphs for data augmentation. For each graph, the model outputted the log probability for each class and the negative log‐likelihood loss was calculated. After each epoch, the model was evaluated on the validation set. The model that gave the lowest loss on the validation set over all the epochs was chosen as the best model. Training for 100 epochs on a NVIDIA GeForce RTX 2060 with 6 GB RAM took ≈1.5 h for the Digits and Letters dataset and ≈5 min for the cell dataset. ClusterNet‐LCF was trained in an end‐to‐end manner, meaning that LocNet and ClusterNet were trained together as a single network.
Evaluation Procedure
The best model for each split was evaluated on the test set for each split. For evaluation, each graph (without random rotation) was classified according to the highest probability class from the model. For the Digits and Letters the predictions for each class were evaluated using recall and combined into a single metric for all the classes using the arithmetic mean (balanced accuracy).[
48
] For the cell dataset, the predictions for both classes were evaluated using balanced accuracy for binary classification.[
48
] The probabilities for each class were evaluated using the AUROC.
Evaluation on the Reserved Test Set for the Digits and Letters Dataset
The performance of ClusterNet‐HCF and ClusterNet‐LCF were ultimately compared on the reserved test set. First, the entire dataset (excluding the reserved test set) was split into a training (80%) and validation (20%) set, by randomly taking ≈20% of each class into the validation set. Next, each model was trained on the training set and saved when the loss was lowest on the validation set. These models were then evaluated on the reserved test set following the evaluation procedure above.
Feature Analysis via UMAP
We used UMAP to visualize the per‐cluster and whole‐graph features and to explore if they separated the classes. Four sets of features were visualized: HCF, cluster features embedded by LocNet, cluster features after the fourth message passing layer of ClusterNet, and whole‐graph features after the final maximum pooling layer of ClusterNet. The handcrafted and LocNet features only consider the structure at cluster‐level and smaller and are later referred to as isolated per‐cluster features. The cluster features from ClusterNet have a larger receptive field incorporating information from neighboring clusters. Finally, the whole‐graph features from ClusterNet pool information over all constituent clusters. In all cases, features were normalized by subtracting the mean and dividing by the variance of each feature independently, measured over the entire dataset excluding the reserved test set. UMAP generated a lower‐dimensional (2D) representation of the features, with 20 neighbors for each feature vector and 0.5 minimum distance in the lower‐dimensional space. UMAP plots can be visualized interactively, displaying the parent ROI, GT, and prediction.
Structure Analysis via SubgraphX
SubgraphX was used to identify structures in the graphs constructed from the cluster nodes (cluster graphs) and their features (handcrafted or embedded by LocNet) that were important for the classification.[
21
] SubgraphX was preferred over similar methods due to its high performance in non‐SMLM graph data benchmarks.[
49
] SubgraphX searches for the most important connected subgraph in the cluster graph by feeding subgraphs induced by different sets of nodes into ClusterNet and measuring their relative contribution to the model's prediction (see methods section in Supporting Information). The optimal subgraph was required to have no more than eight nodes, and the number of rollouts was increased from 20 (default value) to 100, to minimize instability of the prediction (further details and parameter values in methods section in Supporting Information).
Positive and negative fidelity scores measured the necessity and sufficiency, respectively, of the optimal subgraph (subset of cluster nodes) for the prediction.[
22
] They calculated the difference between the probability of the predicted class when the whole graph was fed into the model, and when the graph minus the subgraph (positive fidelity) or only the subgraph (negative fidelity) was fed into the model.[
50
,
51
,
52
,
53
] Best performance was given by a positive fidelity of one and negative fidelity of zero, and worst performance by positive fidelity of zero and negative fidelity of one.
Experimental Section
4.14.1.1
Digits and Letters Dataset
To develop and test our pipeline we used the open‐source Digits and Letters dataset from.[
19
] Further details on how these datasets were acquired can be found at[
4
] for the Digits and Grid and at[
18
] for the Letters. The dataset comprises 22 047 SMLM ROIs from DNA‐PAINT data acquired from DNA origami structures, with one structure per ROI. This included Digits (4155 × Digit 1, 4943 × Digit 2, 2541 × Digit 3), Letters (1161 × Letter L, 991 × Letter T, 560 × Letter O), and a 3 × 4 grid (7696 × Grid), imaged separately per class.[
4
,
18
] The number of localizations per ROI and localization uncertainty for each class are visualized and characterized in Figure S1 and Table S1, Supporting Information.
Tumor Dataset from the PICCOLO Trial
PICCOLO was a randomized phase III trial of second‐line irinotecan with or without panitumumab (anti‐EGFR therapy) in advanced colorectal cancer.[
23
,
24
] As part of the trial, pretreatment tumor samples were collected from each patient (mainly from the primary tumors but with ≈5% from metastases). Best response by Response Evaluation Criteria In Solid Tumors (RECIST) criteria within 1‐year follow‐up from randomization was recorded.[
24
,
35
]
In our study, we imaged cells in tumor cores from the preconstructed formalin‐fixed paraffin‐embedded (FFPE) TMA blocks from PICCOLO, preferred over whole slides to minimize time identifying tumor regions and increase throughput. Only patients who received anti‐EGFR therapy (panitumumab arm) and tested wild‐type (WT) for confounding genetic mutations were included (mutations in KRAS codons 12, 13, 61, and 146; NRAS codons 12, 13, and 61; and BRAF codon 1799 T > A). Patients with BRAF‐mutant tumors were excluded because anti‐EGFR therapy is less effective for this subgroup.[
36
] The best response variables (from RECIST) were grouped into any‐response (complete response, partial response) and no‐response (radiological progression, clinical progression, death), with patients that only achieved stable disease excluded to simplify the problem.
Sample Preparation and SMLM Imaging for the Tumor Dataset
3 μm sections were cut from each preconstructed FFPE TMA block and placed on APES‐coated 1.5 H coverslips. Coverslips were placed on a hotplate at 70 °C for 30 min and then inside a pressure cooker with Reveal Decloaker (pH 6.0) for antigen retrieval. At room temperature, samples were then quenched with ammonium chloride for 15 min, permeabilized with 0.5% Triton‐X in (PBS) phosphate buffered saline for 1 h and blocked with 3% bovine serum albumin (BSA) plus 20% donkey serum for 1 h. Samples were then incubated with Roche RTU anti‐EREG antibody (SP326: rabbit monoclonal, Roche) overnight at 4 °C and then with donkey anti‐rabbit Alexa 647 for 1 h at room temperature.
Each section contained samples from up to ≈30 patients, with three tumor cores (preselected from the region of highest tumor) per patient. For each patient included in this study, the core with the highest tumor content (greatest area of anti‐EREG staining in widefield scans) was selected for imaging. One FOV was imaged at the region of greatest staining within the core.
To image the samples, we performed total internal reflection fluorescence microscopy dSTORM imaging using a commercial system (Nanoimager, ONI, UK) and a 100 × 1.4 NA oil‐immersion objective lens with a 50 × 80 μm FOV. Samples were bathed in STORM buffer (B‐cubed buffer, ONI, BCA0017). Using an exposure time of 30 ms, 10 000 frames were acquired using 640 nm laser excitation at 80% power of the maximum excitation output of the Nanoimager. 2D localization of fluorescence emission events was performed while imaging using NimOS (ONI).
Drift correction, filtering, and temporal grouping for each FOV were performed using CODI (COllaborative DIscovery platform from ONI; https://oni.bio/nanoimager/software/codi‐software/). Localizations with >30 000 photons, with a standard deviation of the fitted point spread function (PSF) <75 nm or >200 nm, with a p‐value for the fitted PSF above 0.01 or with a localization precision >25 nm were removed. Localizations within 60 nm and no more than two frames apart were grouped, removing those that existed for longer than five frames. The data in its proprietary format was then converted into an Apache Parquet file, a column‐oriented data format which can be more efficient for querying and storing than.csv files (https://parquet.apache.org/). For each localization, the channel, frame number, and xy coordinates were stored.[
37
]
Preprocessing and Clustering the Digits and Letters Dataset
Each ROI was preprocessed and clustered in preparation for feature extraction and graph representation as follows. The point cloud was initially partitioned into clusters following a similar approach to PointTransformer V2.[
38
] First, the xy localization coordinates for each ROI were converted from a MATLAB to an Apache Parquet file and labeled in the metadata according to the character they represent. This gave seven classes (Digit 1, Digit 2, Digit 3, Letter T, Letter O, Letter L, or Grid). Next, k‐means clustering was applied to each ROI with k set to 12 to ensure every well‐separated group of DNA‐PAINT binding sites was recovered (Grid had 12 well‐separated binding sites; Digits had more binding sites, but not well‐separated). Clusters with two or fewer localizations were discarded to allow future calculation of the convex hull and principal components. We chose k‐means clustering over DBSCAN, as DBSCAN requires careful tuning of two hyperparameters rather than one and risks dropping many important localizations considered as noise.[
39
]
Cell Annotation, Preprocessing, and Clustering
Cells within FOVs were annotated and their localizations extracted using locpix.[
40
] First, the unfiltered localizations were rendered into a histogram to facilitate annotation. The membranes of cells with low intensity at their center and high intensity at their membrane were annotated, generating a membrane mask. Annotations were either closed with themselves or the edge of the FOV. Cells were ≈5–15 μm in diameter and could be at the edge of the FOV (only partially visible). The cell‐containing region was generated by flood‐filling the membrane mask at seed locations manually placed at the cell centers. The watershed algorithm was then applied to the membrane mask, over the cell‐containing region, to generate the individual cell masks. Having first checked these annotations still correctly overlaid the higher quality localizations (with drift correction, filtering, and temporal grouping), the cell masks and membrane annotations were used to extract the higher quality localizations for each cell and label each localization according to its location (membrane or interior). All localizations not part of a cell were removed. Cells with fewer than 500 localizations, or with fewer than five interior or membrane localizations, were removed. The xy localization coordinates for each cell were saved as an Apache Parquet file with a label in the metadata for the treatment response (any‐response or no‐response). This gave 163 cells (no‐response: 117, any‐response: 46) from 23 patients (no‐response: 18, any‐response: 5) in which the numbers of cells per patient were unequal, ranging from 1–31 cells per patient (Figure S5, Supporting Information).
Each cell was clustered in preparation for feature extraction and graph representation. We trialed k‐means clustering, with k = 12, 24, 48, 72, 96, 120, or 144, and DBSCAN clustering, with epsilon, ε = 50 nm, 75 nm, or 100 nm and minimum samples, minPts = 3, 5, or 7. The DBSCAN parameters were set to similar values as used in previous studies to identify clusters of EGFR in SMLM data.[
40
,
41
] Clusters with two or fewer localizations were discarded to allow calculation of the convex hull and principal components.
Handcrafted Feature Extraction
Eight handcrafted features were calculated for each cluster. First, the number of localizations per‐cluster (count), the mean squared distances of localizations within the cluster from the cluster centroid (radius of gyration squared) and the perimeter from its convex hull were calculated. Next, principal component analysis was used to calculate the variance of the clusters along the two principal components, λ0 and λ1, where λ0>λ1. These were used to calculate linearity (λ0 − λ1 λ0 ), planarity (λ1λ0), length (2.35×λ0), and area (2.352×λ0λ1)of each cluster (the full width at half maximum of a Gaussian is given by 2.35 times the standard deviation).[
42
,
43
] In 3D, planarity is given by λ1−λ2λ0, where λ2 is the third principal component.[
42
] The data were 2D in this study, so λ2 = 0 and linearity = 1−planarity, which made one of linearity or planarity redundant. However, nothing is lost by including the redundant features in 2D, except a small increase in memory usage and training time. Finally, density for each cluster was calculated by dividing count by area.
Graph Construction
Each SMLM ROI (Digits and Letters) or cell (tumor dataset) was represented as a graph using PyTorch Geometric.[
44
] Each graph contained localization and cluster nodes (from clustering as described), where each localization node belonged to its cluster node and undirected edges connected each cluster node to itself and its nearest five neighbors. The position of each localization node was given by its coordinates and the position of each cluster node was given by the center of mass of its constituent localizations. xy node positions were normalized to between −1 and 1, x
→2( x−min(x))max(xrange,yrange)−1 and y
→2( y−min(y))max(xrange,yrange)−1, where the minimum and range were measured over the parent graph. Cluster nodes initially had either no features or the eight handcrafted features depending on the downstream model (learnt or handcrafted features). When present, these features, h, were normalized to between zero and one, h→h−min(h)max(h)− min(h), where the minimum and maximum values were measured over the whole training set for the relevant dataset. The localization nodes had no input features.
Data Partitions
For the Digits and Letters dataset, 20 367 graphs (Digit 1: 3915, Digit 2: 4703, Digit 3: 2301, Letter L: 921, Letter T: 751, Letter O: 320, Grid: 7456) were used for five‐fold cross‐validation. A further 240 graphs from each class formed a reserved test set. The five different splits in cross‐validation each contained a training (64%), validation (16%), and test set (20%, nonoverlapping between splits). For each split, the training, validation, and test set each had the same proportion of classes as the overall cross‐validation dataset.
For the cell dataset, the cells were partitioned for five‐fold cross‐validation of model performance. The cell dataset was first divided into five folds (groups of cells). Each fold was used for testing, with the remaining four folds used for training. During training, 20% of this training set formed a validation set, where the ratio of any‐response to no‐response cells in the validation set was the same as in the training set, or as close as possible. This was used to monitor performance on data not used to train the model. The cells from a single patient could only be in one of the folds, to ensure that none of the models were trained and tested on cells from the same patient. Further, as best as possible, the folds had the same ratio of any‐response to no‐response cells as in the overall dataset.
Model Architecture
Two different neural network models were developed to classify each graph, ClusterNet‐HCF and ClusterNet‐LCF. ClusterNet‐HCF passed HCF, as described, and the positions of the cluster nodes, through a novel graph neural network, ClusterNet. ClusterNet‐LCF instead generated cluster features via DL using an additional point‐based module, LocNet, and passed them, with cluster position, through ClusterNet in a single network. Brief descriptions of the models are given below, with further details in Methods and Figure S6 in Supporting Information.
LocNet acted on each cluster independently, taking the constituent localization node positions, p∈ℝ2, as input and embedding the localizations into a feature vector (length eight) for each cluster using PointTransformer v1.[
45
,
46
] The feature vector was chosen to be length eight to allow a fair comparison between ClusterNet‐HCF, which used an eight‐dimensional input feature, and ClusterNet‐LCF. Increasing the dimension of this feature vector may allow ClusterNet‐LCF to represent and classify more complex structures. Output cluster node features from LocNet were then scaled to between 0 and 1 using the sigmoid function.
ClusterNet acted on the graph of cluster nodes with their feature vectors and their xy position. It was composed of four message passing layers with PointTransformer convolutions, a global maximum pooling layer over the cluster features, resulting in a feature vector for the graph, and a final linear layer, generating a class prediction. When classifying the cells, the number of output channels for the ClusterNet was changed from 7 to 2 (Figure S6, Supporting information).
Training Procedure
For each split in five‐fold cross‐validation, the model was trained on the training dataset for 100 epochs using the ADAM optimizer with a learning rate of 0.001, a weight decay of 0.0001, and a batch size of 128 for the Digits and Letters or eight for the cells.[
47
] During training, a weighted random sampler that oversampled from the minority class and undersampled from the majority class was used to encourage equal performance across the classes. Further, random rotations in the xy plane were applied to the graphs for data augmentation. For each graph, the model outputted the log probability for each class and the negative log‐likelihood loss was calculated. After each epoch, the model was evaluated on the validation set. The model that gave the lowest loss on the validation set over all the epochs was chosen as the best model. Training for 100 epochs on a NVIDIA GeForce RTX 2060 with 6 GB RAM took ≈1.5 h for the Digits and Letters dataset and ≈5 min for the cell dataset. ClusterNet‐LCF was trained in an end‐to‐end manner, meaning that LocNet and ClusterNet were trained together as a single network.
Evaluation Procedure
The best model for each split was evaluated on the test set for each split. For evaluation, each graph (without random rotation) was classified according to the highest probability class from the model. For the Digits and Letters the predictions for each class were evaluated using recall and combined into a single metric for all the classes using the arithmetic mean (balanced accuracy).[
48
] For the cell dataset, the predictions for both classes were evaluated using balanced accuracy for binary classification.[
48
] The probabilities for each class were evaluated using the AUROC.
Evaluation on the Reserved Test Set for the Digits and Letters Dataset
The performance of ClusterNet‐HCF and ClusterNet‐LCF were ultimately compared on the reserved test set. First, the entire dataset (excluding the reserved test set) was split into a training (80%) and validation (20%) set, by randomly taking ≈20% of each class into the validation set. Next, each model was trained on the training set and saved when the loss was lowest on the validation set. These models were then evaluated on the reserved test set following the evaluation procedure above.
Feature Analysis via UMAP
We used UMAP to visualize the per‐cluster and whole‐graph features and to explore if they separated the classes. Four sets of features were visualized: HCF, cluster features embedded by LocNet, cluster features after the fourth message passing layer of ClusterNet, and whole‐graph features after the final maximum pooling layer of ClusterNet. The handcrafted and LocNet features only consider the structure at cluster‐level and smaller and are later referred to as isolated per‐cluster features. The cluster features from ClusterNet have a larger receptive field incorporating information from neighboring clusters. Finally, the whole‐graph features from ClusterNet pool information over all constituent clusters. In all cases, features were normalized by subtracting the mean and dividing by the variance of each feature independently, measured over the entire dataset excluding the reserved test set. UMAP generated a lower‐dimensional (2D) representation of the features, with 20 neighbors for each feature vector and 0.5 minimum distance in the lower‐dimensional space. UMAP plots can be visualized interactively, displaying the parent ROI, GT, and prediction.
Structure Analysis via SubgraphX
SubgraphX was used to identify structures in the graphs constructed from the cluster nodes (cluster graphs) and their features (handcrafted or embedded by LocNet) that were important for the classification.[
21
] SubgraphX was preferred over similar methods due to its high performance in non‐SMLM graph data benchmarks.[
49
] SubgraphX searches for the most important connected subgraph in the cluster graph by feeding subgraphs induced by different sets of nodes into ClusterNet and measuring their relative contribution to the model's prediction (see methods section in Supporting Information). The optimal subgraph was required to have no more than eight nodes, and the number of rollouts was increased from 20 (default value) to 100, to minimize instability of the prediction (further details and parameter values in methods section in Supporting Information).
Positive and negative fidelity scores measured the necessity and sufficiency, respectively, of the optimal subgraph (subset of cluster nodes) for the prediction.[
22
] They calculated the difference between the probability of the predicted class when the whole graph was fed into the model, and when the graph minus the subgraph (positive fidelity) or only the subgraph (negative fidelity) was fed into the model.[
50
,
51
,
52
,
53
] Best performance was given by a positive fidelity of one and negative fidelity of zero, and worst performance by positive fidelity of zero and negative fidelity of one.
Code Availability
Code Availability
All code used for analysis was written in Python and is available at https://github.com/oubino/locpix_points/tree/v0.0.1, with latest developments available at https://github.com/oubino/locpix_points. We highlight particular dependencies in Methods, Supporting Information.
All code used for analysis was written in Python and is available at https://github.com/oubino/locpix_points/tree/v0.0.1, with latest developments available at https://github.com/oubino/locpix_points. We highlight particular dependencies in Methods, Supporting Information.
Ethics Statement
Ethics Statement
Ethical approval for the work involving human tissue was granted by the North East York Research Ethics Committee (08/H0903/62). The study includes patients who, at the point of consent for the PICCOLO trial, had agreed to the use of their archival tissue in future research and for whom adequate stored formalin‐fixed, paraffin‐embedded (FFPE) tumor tissue remained. The ethics for PICCOLO trial consent and use in research is COREC 06/Q0906/38.
Ethical approval for the work involving human tissue was granted by the North East York Research Ethics Committee (08/H0903/62). The study includes patients who, at the point of consent for the PICCOLO trial, had agreed to the use of their archival tissue in future research and for whom adequate stored formalin‐fixed, paraffin‐embedded (FFPE) tumor tissue remained. The ethics for PICCOLO trial consent and use in research is COREC 06/Q0906/38.
Supporting Information
Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.
Supporting Information is available from the Wiley Online Library or from the author.
Conflict of Interest
Conflict of Interest
The authors declare no conflict of interest.
The authors declare no conflict of interest.
Supporting information
Supporting information
Supplementary Material
Supplementary Material
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- LCMS-Net: Deep Learning for Raw High Resolution Mass Spectrometry Data Applied to Forensic Cause-of-Death Screening.
- A novel real-world data methodology for lymphoma outcome classification: the real-world Lugano study.
- Artificial Intelligence-Enhanced Optimization of Wireless Breath Sensor Arrays for Detection of Lung Cancer Using Fuzzy Logic-Guided Genetic Algorithm and Multimodal Machine Learning.
- PIBAdb: a public cohort of multimodal colonoscopy videos and images including polyps with histological information.
- Exploring the Role of Extracellular Vesicles in Pancreatic and Hepatobiliary Cancers: Advances Through Artificial Intelligence.
- Feasibility of Depth-in-Color En Face Optical Coherence Tomography for Colorectal Polyp Classification Using Ensemble Learning and Score-Level Fusion.