본문으로 건너뛰기
← 뒤로

scTELL: a single-cell ATAC-seq tool for locus-specific transposable element identification in chromatin accessibility.

1/5 보강
Mobile DNA 2026 Vol.17(1)
Retraction 확인
출처

Jeong K, Ha H, Xing J, Choi J, Kim K

📝 환자 설명용 한 줄

[BACKGROUND] Transposable elements (TEs) constitute a substantial fraction of the human genome and contribute to gene regulatory programs.

이 논문을 인용하기

↓ .bib ↓ .ris
APA Jeong K, Ha H, et al. (2026). scTELL: a single-cell ATAC-seq tool for locus-specific transposable element identification in chromatin accessibility.. Mobile DNA, 17(1). https://doi.org/10.1186/s13100-026-00395-y
MLA Jeong K, et al.. "scTELL: a single-cell ATAC-seq tool for locus-specific transposable element identification in chromatin accessibility.." Mobile DNA, vol. 17, no. 1, 2026.
PMID 41731605 ↗

Abstract

[BACKGROUND] Transposable elements (TEs) constitute a substantial fraction of the human genome and contribute to gene regulatory programs. However, systematic analysis of TEs at the individual locus level remains technically challenging, particularly in single-cell contexts. While single-cell technologies have advanced the study of cellular heterogeneity, most analytical frameworks remain gene-centric. Existing TE-focused approaches are largely restricted to transcriptional profiling using scRNA-seq data, while analyses of single-cell chromatin accessibility have focused primarily on aggregate or family-level TE signals rather than individual loci. Consequently, no dedicated computational framework exists for quantifying chromatin accessibility at individual TE loci from scATAC-seq data, limiting investigation of locus-specific TE regulatory activity at single-cell resolution.

[RESULTS] scTELL (single-cell Transposable Element Locus-Level analysis) is a computational framework that quantifies TE accessibility at individual loci from scATAC-seq data using a distance-weighted scoring scheme. We applied scTELL to diverse biological systems, including healthy peripheral blood mononuclear cells (PBMCs), clear cell renal cell carcinoma (ccRCC), and breast cancer (BC). In PBMCs, scTELL identified distinct cell-type-specific TE accessibility patterns with clustering performance comparable to established gene activity scoring approaches, and validated key TE accessibility patterns using bulk ATAC-seq data from sorted immune cell populations. Motif enrichment analyses of TE-associated accessible regions revealed distinct TF motif landscapes, including family-level motif signatures, within-family locus heterogeneity across cell types, and motifs enriched in TE-associated regions relative to gene promoters. In cancer contexts, scTELL identified heterogeneity-associated TE loci and observed clinically associated accessibility patterns, including an L1PA2 locus in ccRCC associated with progression-free interval, and survival-associated TE loci in BC.

[CONCLUSIONS] scTELL provides a much-needed and robust tool to investigate the locus-specific regulatory landscape of TEs at single-cell resolution. Our findings demonstrate that this approach can uncover previously unrecognized cell-type-specific and disease-associated TE accessibility. The scTELL framework offers a new layer of biological insight, complementing existing single-cell analysis protocols and enabling the discovery of candidate biomarkers from a vast, understudied portion of the genome. While these associations are reproducible across datasets, prospective validation and functional studies will be required to establish clinical utility and to determine whether any locus has a causal role or therapeutic relevance.

[SUPPLEMENTARY INFORMATION] The online version contains supplementary material available at 10.1186/s13100-026-00395-y.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (1)

📖 전문 본문 읽기 PMC JATS · ~77 KB · 영문

Background

Background
Transposable elements (TEs) comprise a substantial proportion of eukaryotic genomes, accounting for approximately 50% of the human genome [1]. Once considered functionless DNA, TEs are now recognized as important contributors to genome evolution and regulation. TEs influence genome structure, gene expression programs, and cell-type specification by modifying chromosomal architecture and introducing regulatory sequences such as enhancers, promoters, and other cis-regulatory elements, thereby shaping gene regulatory networks and developmental processes [2–5]. Accumulating evidence indicates that TE-derived sequences represent a major and diverse source of context-dependent cis-regulatory elements in mammalian genomes, although their regulatory potential varies substantially across genomic loci and cellular states [6–8].
The advent of single-cell sequencing technologies, including single-cell RNA sequencing (scRNA-seq) and single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq), has transformed our understanding of cellular heterogeneity and gene regulation. However, most single-cell analytical frameworks remain gene-centric, and systematic evaluation of TE expression or activity at single-cell resolution has been limited. Conventional transcriptomic pipelines often discard multi-mapping reads, resulting in loss of information from repetitive TE sequences. Recent approaches have emerged to address this limitation. For example, scTE enables quantification of TE expression in scRNA-seq data and has revealed roles of TE accessibility during development, cell differentiation, and disease progression [9]. While these approaches represent important advances, they primarily focus on transcriptional output and typically quantify TE accessibility at the family or subfamily level, which can obscure locus-specific regulatory potential. Moreover, transcription-based measurements do not capture the upstream regulatory layer of chromatin accessibility, leaving a critical gap in the analysis of TE regulation in single-cell epigenomic data [7].
Locus-specific analysis of TEs remains technically challenging due to high sequence similarity among retrotransposon copies, particularly for young elements. As a result, most existing studies have relied on bulk assays or aggregate-level analyses, which cannot resolve the accessibility of individual TE loci within heterogeneous cell populations. Although recent scATAC-seq studies have begun to incorporate repetitive elements at the family or subfamily level, true locus-level resolution has remained largely unexplored due to mapping ambiguity and signal attribution challenges, especially for evolutionarily young retrotransposons.
Beyond mapping ambiguity, scATAC-seq poses additional obstacles for insertion-level TE quantification. The inherent sparsity of single-cell accessibility profiles and positional uncertainty of peak summits relative to functional regulatory sites complicate attribution of accessibility signals to specific TE insertions using simple overlap-based counting. Robust locus-level quantification therefore benefits from scoring strategies that incorporate distance-dependent signal contributions, analogous to approaches used for gene activity estimation. A practical strategy is to focus on TE loci supported by called ATAC-seq peaks and to leverage the accumulated sequence divergence of many TE copies, which renders a substantial fraction of peak-supported insertions uniquely mappable under unique-mapping–based preprocessing. While this approach may underrepresent the youngest subfamilies with near-identical copies —a limitation we address in the Discussion—it enables reliable locus-level interpretation for many accessible TE loci in single-cell data.
Nevertheless, accumulating evidence indicates that TEs are dynamically regulated rather than constitutively silenced. During early embryogenesis, TEs contribute to pre-implantation development, with their activation linked to major developmental transitions. TE expression is associated with early cell fate decisions and the transition from totipotency to pluripotency [10]. Single-cell transcriptomic analyses have revealed that endogenous retroviruses (ERVs) are preferentially expressed during minor embryonic genome activation, whereas other TE classes such as Alu elements increase expression at later developmental stages [11]. These findings underscore the importance of resolving TE accessibility with both temporal and locus-level precision.
Beyond embryogenesis, TEs interact extensively with the immune system and contribute to innate immune signaling. Because TE-derived nucleic acids share structural similarities with viral genomes, their activation can trigger pattern recognition receptors and induce type I interferon (IFN) responses. TE-mediated immune activation has been implicated in antiviral defense, aging, and autoimmune diseases characterized by chronic IFN signaling [12]. In cancer, TEs have emerged as important regulatory components. Cancer cells can exploit TE-derived regulatory sequences to recruit lineage-specific transcription factors and rewire transcriptional programs [13]. In particular, young retrotransposons such as LINE-1 and SVA elements have been implicated in chromatin remodeling, transcriptional dysregulation, and tumor heterogeneity. These activities position TEs as both drivers of tumor evolution and potential therapeutic vulnerabilities [14–16].
Despite these advances, locus-specific TE accessibility has not been systematically explored in single-cell chromatin accessibility data. Existing scATAC-seq frameworks such as Signac [17] and ArchR [18] provide powerful gene-centric analyses but generally lack dedicated functionality for quantitative profiling of TE accessibility at individual insertion sites. While some studies have incorporated repetitive elements at the family or subfamily level, locus-level analysis that enables construction of a cell × TE locus accessibility matrix and supports standard single-cell analytical workflows—including differential accessibility testing, dimensionality reduction, and embedding-based heterogeneity analysis—has not been systematically implemented. In addition, validation of single-cell–derived TE patterns against independent bulk datasets and large clinical cohorts is important to establish reproducibility and to assess potential clinical relevance. To address this methodological gap, we developed scTELL (single-cell Transposable Element Locus Level analysis), a computational framework that enables quantitative assessment of locus-specific TE accessibility from scATAC-seq data. By allowing direct cell-to-cell comparisons of TE accessibility at individual genomic loci, scTELL provides a means to investigate how TEs shape epigenomic landscapes across diverse cellular contexts. We validated scTELL across multiple cancer types and biological systems, demonstrating its ability to identify biologically relevant TE loci and uncover their contributions to cancer heterogeneity. This locus-level framework advances our understanding of TE-mediated regulation in both normal and pathological states and highlights opportunities for dissecting epigenomic diversity in precision medicine.

Methods

Methods

Single-cell ATAC-seq data processing
All single-cell ATAC-seq datasets were generated using the 10 × Genomics Chromium platform and processed with the CellRanger-ATAC pipeline, which employs BWA-MEM for read alignment with unique mapping only. For the PBMC, ccRCC Discovery, and BC datasets, we utilized pre-processed fragment files from the original data sources. For the ccRCC Validation dataset, raw FASTQ files were processed in-house using CellRanger-ATAC with default parameters. Detailed processing information, including reference genome versions, for each dataset is provided in Supplementary Table S4. Briefly, the PBMC dataset was processed using the hg19 reference genome, while the ccRCC and BC datasets were aligned to GRCh38. Since scTELL operates at the locus level—where each TE insertion is treated as a unique genomic coordinate—rather than aggregating signals across TE family copies as in element-level methods, standard unique mapping is appropriate for our analytical approach. Individual TE loci, particularly older elements, have accumulated sufficient sequence divergence to be uniquely mappable, and our requirement that analyzed loci overlap with called ATAC-seq peaks further ensures robust, unambiguous accessibility signals.
Fragment files were subsequently processed using the ArchR pipeline [18]. High-quality cells were selected through fragment filtering (Transcription Start Site (TSS) enrichment score ≥ 4, unique fragments ≥ 1,000), followed by iterative Latent Semantic Indexing (LSI) and Uniform Manifold Approximation and Projection (UMAP) embedding based on TileMatrix. Peak calling was performed using MACS2 [19].

10 × PBMC ATAC v2 dataset
Cell type annotation was performed based on marker gene accessibility patterns in the 10 × Genomics PBMC v2 dataset. Key marker genes included CD8A for CD8+ T cells and NK cells, MS4A1 for B cells, NCR1 for NK cells, FLT3 for monocytes, and MAL for CD4+ T cells. Differential accessibility analysis was conducted for each cluster (False Discovery Rate (FDR) < 0.01 & Log2 Fold Change (Log2FC) ≥ 1), and clusters exhibiting high accessibility of these marker genes were annotated as the corresponding cell type.

ccRCC dataset
The clear cell renal cell carcinoma (ccRCC) analysis incorporated two independent cohorts [20, 21]: a discovery set comprising 54,966 cells from 19 samples (GSE207493; pre-processed fragment files, CellRanger-ATAC v1.2.0, GRCh38) and a validation set of 26,795 cells from 3 samples (PRJNA768891; processed in-house using CellRanger-ATAC with default parameters, GRCh38). Cell type annotation was based on marker gene expression from paired scRNA-seq data, with annotations transferred to scATAC-seq data using a label transfer approach. Cell type nomenclature was standardized between discovery and validation sets for systematic comparison (Supplementary Table S1).

BC dataset
We utilized breast cancer (BC) scATAC-seq data from the Human Tumor Atlas Network (HTAN) comprising 147,183 cells collected from 24 primary tumor samples (Supplementary Table S2: breast cancer sample list). Pre-processed fragment files (Level 3 data, CellRanger-ATAC v2.0.0, GRCh38) and corresponding metadata were aggregated across all samples. For tumor heterogeneity analysis, we focused on the tumor cell compartment, consisting of 90,322 cells identified through cell type annotation.

Bulk ATAC-seq datasets

Bulk ATAC-seq datasets
We utilized two independent bulk ATAC-seq datasets for validation of single-cell findings. First, Fluorescence-Activated Cell Sorting (FACS)-sorted PBMC populations from GSE118189 [22]were used to validate cell type-specific TE patterns. Additionally, we analyzed pan-cancer ATAC-seq data from The Cancer Genome Atlas (TCGA) ATAC-seq working group [23] integrating peak-level normalized signal matrices with clinical information to investigate associations between TE accessibility patterns and patient outcomes.

scTELL framework

scTELL framework

Locus-level TE accessibility quantification

Target TE selection and filtering
TE loci were selected based on RepeatMasker annotation [24, 25], with distinct filtering criteria applied according to the biological context of each dataset. For cancer studies (BC and ccRCC), we focused on three major classes of young, structurally intact TEs with potential regulatory activity: L1 family elements (L1HS through L1PA8A) exceeding 5 kbs in length, SVA elements (SVA_A through SVA_F) longer than 700 bps, and HERVK elements greater than 5 kbs (Supplementary Table S3). These size thresholds were selected to enrich for structurally intact elements capable of regulatory activity. The focus on these young, structurally intact transposable elements in our scATAC-seq analyses was motivated by both methodological considerations and biological relevance in cancer contexts. In scATAC-seq data, accessibility signals are detected as peak regions representing aggregated transposition events rather than precise nucleotide-level boundaries of open chromatin. Consequently, the detected peak summit does not always coincide exactly with the underlying functional regulatory site. To account for this inherent positional uncertainty, we prioritized transposable element insertions exceeding 5 kb in length, which provide a broader and more clearly defined genomic context for locus-level accessibility scoring using a distance-weighted framework. In contrast, very short elements such as solo LTRs often reside in close proximity to conventional regulatory regions, making it more difficult to confidently attribute detected accessibility peaks to a specific TE locus under sparse single-cell conditions. Longer TE insertions therefore enable more robust attribution of accessibility signals to the TE itself. From a biological perspective, young LINE-1 families such as L1HS and L1PA are of particular interest in cancer, where their reactivation has been linked to chromatin remodeling, transcriptional dysregulation, and genomic instability. Focusing on these elements allowed us to evaluate scTELL in contexts where locus-specific TE accessibility is most likely to reflect functional regulatory activity. This analytical focus does not preclude regulatory roles for shorter elements or solo LTRs but reflects a design choice aligned with both the characteristics of scATAC-seq data and the biological systems examined in this study. In contrast, for the PBMC analysis, we conducted a comprehensive examination of Long Terminal Repeat (LTR) class elements (repClass = = 'LTR'), given their established regulatory roles in immune cell function. To account for potential regulatory regions, all selected TE loci were extended to include upstream 1 kb regions. To ensure robust accessibility assessment, we retained only those loci that demonstrated overlap with single-cell ATAC-seq peaks for final analysis. Representative TE families highlighted in each dataset were chosen to align with prior biological knowledge specific to the corresponding system (e.g., immune-associated LTRs in PBMCs and young LINE-1/SVA elements in cancer), rather than reflecting a limitation of the scTELL framework.

Accessibility score calculation
To quantify locus-level TE accessibility, we implemented a distance-weighted approach based on ArchR's GeneScore methodology [18]. The accessibility score for each TE locus was calculated by integrating ATAC-seq signal intensities within a defined genomic window, weighted by their linear genomic distance from the TE region. Specifically, we employed an exponential decay weighting function:where x represents the distance from the TE boundary, and d denotes the characteristic decay distance. This weighting scheme assigns higher importance to proximal regions while accounting for distance-dependent signal contributions.
For cancer-specific analyses (BC, ccRCC), we utilized a decay distance of 1 kb with 50 bp resolution tiles, extending the analysis window to ± 1 kb from the TE boundaries. For immune cell populations (PBMC), where we analyzed typically shorter LTR elements, we employed shorter-range parameters (500 bp decay distance, 5 bp tiles, 500 bp upstream window) to better capture the local regulatory patterns of these compact elements. These parameters were selected based on the structural characteristics of target TE classes: LTR elements analyzed in PBMCs have a median length of 321 bp (n = 12,414 loci), whereas elements analyzed in cancer datasets (L1, SVA, HERVK) have a median length of approximately 6 kb (BC: 6,024 bp, n = 1,388; ccRCC: 6,023 bp, n = 955) (Supplementary Table S5).
The final accessibility score A for each TE locus was computed as:where represents the normalized ATAC-seq signal at position . This integrative approach enables robust quantification of TE regulatory activity while accounting for local chromatin architecture and technical variation in signal distribution.

Evaluation of the scTELL scoring approach
To evaluate the performance of scTELL's distance-weighted scoring, we compared it against two alternative approaches using the PBMC dataset.
First, we implemented a "Simple Sum" baseline that calculated the unweighted sum of peak scores overlapping each TE locus using an identical 500 bp upstream of each TE locus, without applying distance-based decay weighting. Second, we compared scTELL-derived TE accessibility matrices against ArchR's GeneScoreMatrix computed using default parameters for genic regions, providing a reference for established gene-level accessibility quantification. Cell-type clustering quality was assessed using silhouette scores calculated from the first 30 principal components of each matrix.
To assess whether TE-based clustering signals are confounded by genic open chromatin, we performed two complementary filtering analyses. In the Read Filtering approach, all scATAC-seq reads mapping to annotated gene bodies (TxDb.Hsapiens.UCSC.hg19.knownGene) were excluded by incorporating gene regions into the blacklist prior to TE score matrix calculation. In the Locus Filtering approach, TE loci overlapping with annotated gene regions were excluded from analysis. A conservative variant additionally excluded TE loci within 5 kb of any gene boundary. Silhouette scores were calculated for each filtered configuration to quantify the impact of genic signal removal on cell-type separation.

Cluster-centric analysis

Cluster-centric analysis
To identify cluster (cell type)-specific TE accessibility patterns, we performed differential accessibility testing using the ArchR framework [18]. Since each dataset (PBMC, ccRCC, and BC) exhibits distinct biological complexity, noise profiles, and analytical goals, we applied dataset-specific statistical thresholds tailored to the characteristics of each. This approach allowed us to balance sensitivity and specificity and to ensure the robustness and reproducibility of the identified TE loci. For the PBMC dataset analysis, we employed a less stringent statistical threshold (FDR < 0.05, Log2FC > 0.5) as this dataset comprises well-defined immune cell populations with established markers and lower cellular heterogeneity. For downstream applications such as motif analysis and bulk validation, more stringent thresholds were applied to prioritize high-confidence markers (detailed in Sects. 6.2 and 7 below). In contrast, for the ccRCC analysis, we implemented a stringent cross-validation approach with more rigorous thresholds (discovery: FDR < 0.001, Log2FC > 1; validation: FDR < 0.01, Log2FC > 1) to ensure robust tumor-specific findings across independent patient cohorts. The BC dataset, characterized by high inter-patient heterogeneity, required the most stringent criteria (FDR < 0.0001, Log2FC > 1, Area Under the Curve (AUC) > 0.55) to identify TE loci that reliably distinguish patient-specific patterns and correlate with clinical outcomes. These tailored statistical approaches reflect the distinct biological questions being addressed in each dataset.

Cluster-free analysis for tumor heterogeneity

Cluster-free analysis for tumor heterogeneity
To capture patient-specific TE accessibility patterns within tumor cell populations, we implemented a density-based analytical approach that does not rely on discrete clustering. This method enables identification of TE loci that exhibit shared activity patterns among patient subgroups while showing distinct patterns in others.
We adapted the singleCellHaystack algorithm [26] to detect non-random TE accessibility patterns using Kullback–Leibler divergence. The analytical framework consists of the following steps:

Grid-based density estimation
The two-dimensional embedding space is partitioned into a 50 × 50 grid. At each grid point x, we estimate:The background cell density distribution Q(x) using Gaussian kernel density estimation:  where K is the Gaussian kernel function with bandwidth h, and n is the total number of cells.

For each TE locus , we calculate two conditional density distributions:o: density of cells where the TE is active

o density of cells where the TE is inactive

Divergence calculation
The Kullback–Leibler divergence is calculated as:where x runs over all grid points and represents the binary activity state of the TE locus.

Statistical assessment
The significance of observed values is evaluated through a permutation-based approach:For each TE locus, activity scores are randomly shuffled across cells

is recalculated for each permutation

Empirical P-values are computed by comparing observed to the null distribution

We employ a stringent statistical threshold (P-value < 1e−10) to identify TE loci exhibiting significant non-random accessibility patterns. This approach enables detection of complex heterogeneity patterns that may not be captured by traditional clustering methods, providing insights into patient-specific TE regulatory mechanisms.

Bulk ATAC-seq validation

Bulk ATAC-seq validation
To validate the biological relevance of TE accessibility patterns identified at single-cell resolution, we implemented a systematic validation approach using bulk ATAC-seq datasets. Our validation strategy incorporated both cell type-specific patterns in normal tissue and clinical correlations in cancer contexts.

Locus-level quantification of TE accessibility in bulk data
Note that this bulk “extended region” definition differs from the distance-weighted windows used for single-cell locus quantification (Methods 3.1.2). Single-cell scoring is designed to quantify cell-by-locus accessibility using distance-decay weighting around each TE insertion, whereas bulk validation uses an upstream-extended region (TE body + 1 kb upstream) as a pragmatic window to match TE loci to bulk ATAC-seq peaks. These definitions serve different analytical purposes and are not intended to be identical.
For each TE locus, we defined an extended regulatory region encompassing the full TE sequence and its 1 kb upstream region to capture potential cis-regulatory elements. ATAC-seq peaks overlapping with these extended regions were identified and assigned to corresponding TE loci. For accessibility quantification, we utilized the complete signal intensity of each overlapping peak as a measure of locus-specific activity.

Validation in normal cell types
We utilized bulk ATAC-seq data from FACS-sorted PBMC populations [22] to validate cell type-specific TE patterns. For each cell type, we identified the top 20 marker TE loci from single-cell analysis based on differential accessibility (FDR < 0.001, Log2FC ≥ 1, AUC > 0.6). We then extracted counts from bulk ATAC-seq peaks overlapping these TE loci (including 1 kb upstream regions to capture regulatory elements), performed z-score normalization across samples, and visualized the accessibility patterns as a heatmap to assess concordance between single-cell predictions and bulk validation data.

Clinical validation in cancer
We leveraged TCGA ATAC-seq data [23] to assess the clinical relevance of heterogeneity-associated TE loci identified from single-cell analysis. For BC, TE loci showing significant heterogeneity patterns in cluster-free analysis (p < 1e-10, n = 996) were ranked by statistical significance, and the top 100 loci were selected for clinical validation. For ccRCC, all 131 significant loci were used for validation.
For each TE locus, we identified TCGA ATAC-seq peaks overlapping with the extended regulatory region (TE sequence + 1 kb upstream). Patients were stratified into high and low accessibility groups using the median peak signal intensity as a cutoff. Survival associations were evaluated using Kaplan–Meier analysis with log-rank tests for overall survival (OS), progression-free interval (PFI) with a significance threshold of p < 0.05.
Detailed TE-peak associations, peak-TE distances, nearest gene annotations, median cutoff values, and survival analysis results are provided in Supplementary Tables S6–S9. An overview of the TCGA bulk ATAC-seq matching and survival analysis workflow is provided in Supplementary Figure S7.

Transcription factor motif analysis

Transcription factor motif analysis
To characterize the regulatory landscape surrounding TE loci, we performed motif enrichment analysis using an unbiased framework based on the full JASPAR 2020 CORE vertebrates database of transcription factor binding site (TFBS) motifs [27, 28]. For all analyses, background peaks were selected using MatchRegionStats to match GC content and sequence length distributions of query peaks, minimizing technical bias. Statistical significance was assessed using a hypergeometric test with Benjamini–Hochberg correction.
For cell type-specific analysis in PBMCs, we identified ATAC-seq peaks overlapping marker TE loci for each cell type (B cells, CD4 + T cells, CD8 + T cells, NK cells, and Monocytes), defined by differential accessibility analysis (FDR < 0.01, Log2FC > 1). Motif enrichment was calculated by comparing motif occurrence in these cell type-specific TE-associated peaks against matched background peaks (Supplementary Table S10).
For family-level analysis, all accessible loci within each of 466 TE families were used as query, with matched peaks from the full peak set as background (Supplementary Figure S6a, Supplementary Table S11). To assess within-family heterogeneity, we examined TE families with marker loci in multiple cell types (n = 7 families with ≥ 20 marker loci in ≥ 2 cell types: MLT1A0, MLT1B, MLT1C, MLT1D, MLT1K, MLT1L, and MSTA). Motif enrichment was calculated separately for each cell type-family combination to determine whether loci within the same family recruit distinct regulatory programs across cellular contexts (Supplementary Figure S6b, Supplementary Table S12).
To evaluate TE-specificity of enriched motifs, we compared TE-associated peaks (n = 7,691, excluding those overlapping gene promoters) against gene promoter peaks (TSS ± 2 kb, n = 27,898; defined using EnsDb.Hsapiens.v75). Motifs with fold enrichment > 2 and adjusted P < 0.05 were classified as TE-enriched (Supplementary Figure S6c, Supplementary Table S13).

Data visualization

Data visualization
Data visualization was performed using several R packages: ComplexHeatmap for generating cell-level and sample-level TE locus activity heatmaps, ggpubr for bar plots and statistical visualizations, Rideogram [29] for chromosome ideogram plots, Signac [17] for coverage plots of genomic features, and ArchR [18] for single-cell UMAP visualizations.
For chromosomal ideogram visualization, we employed a dual-color approach to represent cell type-specific TE regulatory patterns and their genomic context. The primary color represents monocyte-specific accessibility significance, calculated as – Log10(FDR) multiplied by the sign of Log2FC from differential accessibility analysis between monocytes and other cell types. This metric integrates both statistical significance and effect direction, with positive values (darker colors) indicating regions highly accessible in monocytes. The secondary color displays LTR element density, calculated as the number of LTR elements per 1 Mb genomic window across the human genome. This visualization method enables simultaneous assessment of cell type-specific regulatory elements against the background of genomic TE distribution.

Results

Results

Overview of the scTELL framework
To overcome the difficulty associated with transcriptional analysis of TEs at single cell resolution, we designed scTELL, which stands for single-cell Transposable Element Locus-Level analysis. The computational model described herein allows estimating chromatin accessibility at TE loci based on single-cell ATAC-seq (scATAC-seq) data.
The premise of scTELL is based on two key hypotheses:

Chromatin structure of TE loci is connected to the TE accessibility in the genome.

Fluctuation in the activity of TE at the single-cell level points to attributes of cellular heterogeneity, such as cell type and state.

For locus-level quantification of TE accessibility, scTELL calculates a distance-weighted accessibility score for each TE locus in each cell by integrating ATAC-seq signals within a defined window surrounding the locus (see Methods for details). This scoring scheme, adapted from the GeneScore function in ArchR [18], enables robust estimation of TE accessibility at single-cell resolution with locus specificity.
The resulting cell-by-locus scores were aggregated into a matrix summarizing TE locus activity across all single-cell data (Fig. 1a). This matrix served as the basis for downstream analyses, including cluster-centric analysis to identify cell type-specific TE loci (Fig. 1b), cluster-free analysis to detect non-random TE accessibility patterns in embedding space (Fig. 1c), and bulk ATAC-seq validation to independently confirm single-cell-derived patterns using population-level data (Fig. 1d).
scTELL quantifies TE accessibility at the individual cell level, generating a cell × TE locus matrix. This data structure mirrors the format used for gene activity scores in established single-cell frameworks such as Signac [17] and ArchR [18], enabling the direct application of standard single-cell genomics analytical strategies—including cell type comparison, clustering, and differential accessibility analysis—to transposable elements. In this study, we leveraged this capability to compare cell type-specific TE patterns in PBMCs, explore intratumoral heterogeneity in ccRCC and BC, and perform clinical validation by linking to TCGA bulk ATAC-seq data. Notably, locus-level resolution enables the discrimination of individual TE insertions that exhibit distinct activity patterns across cell types, even within the same TE family.

Evaluation of the scTELL scoring approach
To evaluate the contribution of our distance-weighted scoring approach, we performed a direct comparison between scTELL and a baseline “Simple Sum” approach using the same 500 bp upstream window surrounding each TE locus; however, the Simple Sum approach calculated unweighted sum of overlapping peak scores, whereas scTELL applied distance-based decay weighting. Using the same set of 12,414 TE loci, scTELL significantly outperformed the unweighted approach in both cell-type clustering quality and biological sensitivity. Specifically, scTELL achieved a silhouette score of 0.226 compared to 0.085 for the Simple Sum method, representing a 2.66-fold improvement (Supplementary Figure S4). This result demonstrates that distance-weighted scoring effectively prioritizes functionally relevant accessibility signals over background noise. Notably, scTELL’s clustering performance was comparable to that obtained using the gene activity score matrix computed with ArchR’s default weighted model for genic regions (silhouette score = 0.227), indicating that scTELL captures cell type-specific TE accessibility with fidelity similar to established gene-level approaches. Given that many TEs are located within or near gene bodies, we next evaluated whether the observed TE-based clustering could be confounded by genic open chromatin signals.
To rigorously assess the potential confounding effects of genic open chromatin on TE-based clustering, we performed two complementary filtering analyses. First, we applied a read-level filtering approach in which all scATAC-seq reads mapping to annotated gene bodies were excluded by incorporating gene regions into the blacklist prior to computing the TE score matrix. Despite this stringent removal of genic reads, the UMAP projection continued to display clear cell-type-specific clustering patterns (Supplementary Figure S5b), indicating that the observed TE-derived signal is not merely an artifact of genic accessibility. Under this condition, the silhouette score decreased from 0.226 to 0.179, corresponding to an approximately 21% reduction. We interpret this decrease as an expected consequence of strict filtering, as many TEs located within genic regions, such as introns or promoters, are known to function as bona fide cis-regulatory elements. Consequently, the removal of all genic reads inevitably discards genuine biological signals in addition to potential noise.
To further validate this interpretation, we performed a locus-level filtering analysis in which TE loci overlapping annotated gene regions were excluded. When only intergenic TE loci were analyzed, the resulting silhouette score was 0.219 based on 10,996 loci (Supplementary Figure S5c,d, Supplementary Table S16). Applying a more conservative filtering strategy with a 5 kb buffer around gene regions yielded a silhouette score of 0.210 based on 10,033 loci. Notably, both values were substantially higher than the score obtained from the read-level filtering approach. Together, these results indicate that the reduction in clustering performance observed after read-level filtering reflects the loss of functionally relevant genic-proximal TEs rather than the removal of artifactual signals. Importantly, even after excluding genic contributions, TE-derived accessibility signals alone remained sufficient to drive robust cell-type separation (silhouette score = 0.179), supporting the biological relevance and robustness of scTELL-based TE profiling.

Identification and functional characterization of cell-type-specific LTR elements in healthy PBMCs using scATAC-seq data
These results indicate that scTELL captures biologically meaningful TE accessibility patterns that are not solely driven by underlying gene-associated chromatin structure. We used the 10 × Genomics PBMC v2 ATAC-seq data to pinpoint and then examine LTR elements responsive in different blood cells. To accomplish this, we subsetted LTRs from the RepeatMasker table where they lie in the vicinity of peak called for single-cell ATAC-seq data, yielding 12,414 LTR loci for downstream analysis. This data showed that the ERVL-MaLR dominated among LTR elements, and they remained the most frequent among the elements selected (Fig. 2a). To identify cell type-specific TE loci, we conducted a Cluster-centric Analysis across major blood cell types. By grouping cells according to known markers, this approach allowed us to pinpoint TE loci with distinct accessibility patterns within specific cell types, such as monocytes, CD4+ T cells, CD8+ T cells, B cells, and NK cells. Cluster-centric analysis effectively highlighted TE loci, whose accessibility was unique to different immune cell types, facilitating a detailed view of how TE accessibility is associated with each cell’s function and identity (Supplementary Table S17). To annotate cell types, we placed the FACS sorted bulk ATAC-seq data into the UMAP of single cell ATAC-seq data (Fig. 2b). The effectiveness of this method was validated based on the activity levels of known marker genes of the specific cell types including MS4A1, MAL, CD8A, FLT3, and NCR1 (Supplementary Figure S1). To display the outcome of cell-type-specific locus analysis, we created a heatmap on how LTR families are segregated in the subtypes of blood cells (Fig. 2c). First, the largest number of LTR loci were active in monocytes, while ERVL-MaLRs displayed a high cell type preference. In order to emphasize the relevance of the described findings, a chromosome ideogram depicting the distribution of monocyte-specific ERVL-MaLR loci was provided (Fig. 2d), illustrating the activity of these elements throughout several chromosomes. To further support our method, we tried to recapitulate the findings of scTE, B cell-specific elements, especially LTR2B. Thus, the majority of LTR2B loci were exclusively B cell active; however, there were certain LTR2B loci that showed activity in both CD8+ T cell or monocyte lineage (Fig. 2e, Supplementary Figure S2). This observation points to the fact that regulatory architecture is not a simple linear combination of elements and requires locus-level tools, such as scTELL. To investigate the regulatory mechanisms underlying these cell-type-specific accessibility patterns, we performed unbiased transcription factor motif enrichment analysis on ATAC-seq peaks overlapping with the identified TE loci. Consequently, this analysis depicted different TF motifs for every cell type under investigation (Fig. 2f). The analysis revealed that the putative binding sites for transcription factors KLF15, SP4, and LEF1 were significantly detected in CD4+ T cells. LEF1 is mostly expressed in lymphocytes and has an immunological function related to the development of both immunocytes and lymphocytes, which are consistent with the regulatory role of CD4+ T cells in immunity. Of all the markers, CD8+ T cells were significantly enriched in RUNX1, EOMES, and TBX2. RUNX1 is involved in the generation of both T and B cells [30–32] and EOMES is dominantly expressed in CD8+ T cells with important contribution to the generation and functionality of cytotoxic T cells [33–36]. In B cells, ITF1, STAT1:STAT2, and SPIB motifs were enriched with ITF1 playing a role in immunoglobulin gene and SPIB in B cell differentiation [37]. Monocytes had a notably higher frequency for SPIC, CEBPD and CEBPA motifs; the latter, CEBPA, plays critical roles in hematopoiesis and monocyte functions [38, 39]. And finally, the isolated NK cells received higher scores for EOMES, TBX20 and TBX2, in particular EOMES, which is essential for NK cell-mediated cytotoxicity [40].
Beyond cell type-level analysis, we evaluated whether scTELL can resolve motif signatures at additional levels of TE organization. Family-level analysis across 466 TE families revealed distinct TF signatures; for example, LTR12C was enriched for NFYA and NFYC, while LTR13A showed enrichment for USF1/USF2 (Supplementary Figure S6a, Supplementary Table S11). Notably, even within the same TE family, cell type-specific marker loci exhibited distinct motif enrichment patterns (Supplementary Figure S6b, Supplementary Table S12), underscoring the value of locus-level resolution. To assess whether these TE-associated motifs differ from gene promoters, we compared TE-associated peaks against gene promoter peaks (used as background) and identified 192 motifs significantly enriched in TE regions (fold enrichment > 2, adjusted P < 0.05), including OLIG2/3, DUXA, and JUNB (Supplementary Figure S6c, Supplementary Table S13). These results support the idea that TE-associated accessible regions exhibit a distinct regulatory grammar compared to gene promoters.
To ensure that cell type-specific loci detected at the single-cell level are credible, we validated these findings using independent FACS-sorted bulk ATAC-seq data [22]. Figure 2g shows that TE loci identified as cell type markers in scTELL analysis exhibit concordant accessibility patterns in the corresponding sorted cell populations, confirming the robustness of our single-cell predictions.

Tumor-specific TE loci in clear cell renal cell carcinoma (ccRCC)
Having established scTELL's ability to identify cell-type-specific TE accessibility patterns in healthy immune cells, we next sought to evaluate its utility in a disease context. We employed scTELL to analyze two series of clear cell renal cell carcinoma (ccRCC) specimens as the Discovery and Validation datasets for the identification of Cell-Type Specific TE loci. The Discovery and Validation datasets contained 54,966 cells from 19 samples and 26,795 cells from 3 samples, respectively (Supplementary Tables S14–S15). First, we clustered the single-cell ATAC-seq data of ccRCC samples using the UMAP method to identify different cell types (Fig. 3a, b). Next, for each cell type, we examined the pattern of cell-type-specific TE loci (Fig. 3c). Differential accessibility statistics for cell type-specific TE loci in ccRCC are provided in Supplementary Tables S18 (Discovery) and S19 (Validation). Employing a selective strategy, we focused on the same TE selection criteria described in our methods section, targeting L1 family elements (L1HS through L1PA8A) exceeding 5 kbs in length, SVA elements (SVA_A through SVA_F) longer than 700 bps, and HERVK elements greater than 5 kbs. This approach identified 955 TE loci and allowed us to concentrate on structurally intact elements with potential regulatory activity. Of particular interest, we identified a locus from the L1PA2 family, chr20:8,595,101–8601127(-), which showed a tumor-specific activity in ccRCC. This locus was also observed to be active only in the ccRCC cell clusters in both Discovery and Validation datasets (Fig. 3d, e). Further investigation into the genomic region spanning ± 5 kb of this L1PA2 locus revealed an enrichment of chromatin accessibility specific to ccRCC, observed consistently in both datasets. Coverage plots highlighted this ccRCC-specific accessibility upstream of the locus (Fig. 3f, g). To further validate that scTELL captures genuine chromatin accessibility at young LINE-1 loci rather than artifacts of peak overlap, we examined genome browser views for multiple representative near full-length LINE-1 insertions. Across these loci, distinct ATAC-seq peaks were consistently observed in close proximity to the 5′ ends of the LINE-1 elements, typically within ~ 200 bp upstream (Supplementary File 1). These peaks fall within the genomic windows used for scTELL scoring and correspond to promoter-proximal regions of the elements, supporting the distance-weighted attribution of accessibility signals to individual TE insertions. All LINE-1 loci shown in the browser views were selected based on a length threshold of > 5 kb to enrich for near full-length elements. RepeatMasker annotation together with putative ORF1 and ORF2 coordinate tracks confirmed that these representative loci retain both ORF1 and ORF2 regions, indicating that they correspond to structurally intact young LINE-1 insertions. Interestingly, the L1PA2 locus is found within the gene PLCB1 (Phospholipase C, β1), and the ATAC-seq peak occupies an intron. Even though a clear relationship between ccRCC and PLCB1 has not been made, PLCB1 is associated with the Phosphoinositide 3-kinase/protein kinase B (PI3K/AKT) signaling pathway involved in cellular migration, invasion, and epithelial-mesenchymal transition all of which are significant for cancer metastasis [41, 42].To determine the clinical implications of these results, we analyzed bulk ATAC-seq data from the TCGA-ATAC (KIRC) cohort for the chromatin accessibility scores at the L1PA2 locus. Patient accessibility was quantified based on the signal intensity of TCGA ATAC-seq peaks overlapping with the extended regulatory region (TE sequence + 1 kb upstream). Patients were stratified into high/low groups using median signal intensity as cutoff (Supplementary Table S7); patients were divided into ‘high’ and ‘low’ accessibility groups and the survival difference was compared by progression-free interval (PFI). Patients in the high-accessibility group resulted in lower PFI than those in the low-accessibility group (Log-rank P-value = 0.038) (Fig. 3h). This will also serve to establish L1PA2 locus accessibility as a prognostic factor in ccRCC of potential clinical value. Next, to look into inter-tumor heterogeneity, we limited our analyses to tumor cells only from the Discovery dataset and conducted another UMAP analysis (Fig. 3i). The results showed that tumor cells for different patients formed separate groups, which revealed that ccRCC was heterogeneous across patients. This observation is consistent with the features described in the original work, further emphasizing the heterogeneity of ccRCC. To explore locus-specific activity of TE elements and its relationship with tumor heterogeneity, we examined the association between the activity of TE loci and the UMAP coordinates. Given the large number of TE loci tested (multiple testing) and the propensity for false positives in embedding-based spatial enrichment analyses, we used a stringent empirical cutoff (p < 1e-10) to prioritize robust non-random accessibility patterns. Among the loci identified in the ccRCC analysis, an SVA-F family locus (chr8:9,153,426–9,154,908, −) emerged as one of the top-ranked loci showing a strong statistical association with tumor heterogeneity. Using TCGA-ATAC KIRC data, we further performed survival analysis and found that patients with lower chromatin accessibility at this locus exhibited significantly poorer progression-free interval (PFI) (log-rank P-value = 0.017) (Supplementary Figure S3). This observation represents a correlative association, and further experimental validation will be required to determine whether this locus plays any functional role in ccRCC progression.

Inter-patient heterogeneity in TE Loci of breast cancer (BC)
To investigate inter-patient heterogeneity in Breast Cancer (BC), we applied scTELL to single-cell ATAC-seq data obtained from 24 BC patient samples, focusing specifically on tumor cell populations to elucidate TE loci dynamics at the single-cell level. Given that breast cancer exhibits substantial inter-patient molecular diversity and that tumor cells from different patients often form distinct clusters rather than shared cell-type groups, we employed a cluster-free analytical approach. By avoiding assumptions of homogeneity within predefined cell types this strategy enabled us to capture unique TE loci patterns associated with individual patients and reflecting the molecular diversity of BC (Fig. 4a-b). Applying the same TE selection criteria as in ccRCC, we analyzed 1,388 TE loci across tumor cells. The UMAP projection revealed distinct spatial grouping of cells by patient identity, indicating pronounced inter-patient divergence in TE-associated chromatin accessibility (Fig. 4a). In parallel, the heatmap of normalized locus activity showed that specific TE loci, including HERVK, L1PA, and SVA subfamilies, exhibited strong and variable activation across patients, further supporting patient-specific TE profiles (Fig. 4b). These findings underscore the importance of patient-level resolution in characterizing TE dysregulation in BC.
To further explore loci associated with clinical outcomes, we applied cluster-free analysis to identify TE loci with non-random distribution patterns across patient-specific UMAP coordinates (Supplementary Table S8). This analysis yielded six prognostic TE loci, each showing significant associations with overall survival in BC patients (log-rank P-value < 0.05; Supplementary Table S9) (Fig. 4c-d). Kaplan–Meier survival analysis demonstrated two distinct patterns: for four loci—L1PA6 (chr1:196,986,363–196,992,657; BRCA_15537), L1PA5 (chr4:95,133,878–95,140,064; BRCA_53849), L1PA6 (chr8:73,234,866–73,241,084; BRCA_98982), and SVA_D (chr10:71,868,657–71,870,506; BRCA_120392)—higher chromatin accessibility was associated with poorer survival (Fig. 4c), while for two loci—L1PA4 (chr9:4,531,071–4537200; BRCA_107739) and L1PA7 (chr12:22,147,464–22,152,713; BRCA_138871)—higher accessibility correlated with improved outcomes (Fig. 4d). Peak identifiers (e.g., BRCA_15537) correspond to row indices in the TCGA-BRCA peak count matrix [23]; note that “BRCA” here denotes the TCGA breast cancer cohort, not the BRCA1/2 genes. These findings suggest that TE loci may serve as context-dependent prognostic markers in BC.

Outcome-associated TE Loci in cancer
To determine the biological function of these TE loci, we performed gene expression analysis and transcription factor motif enrichment analysis on the regions. Two loci, particularly enriched in survival-associated motifs, were noted for further investigation. The first locus, located in the intron of the SLC1A1 gene (L1PA4 family, chr9:4,531,071–4537200(+)), exhibited high accessibility in patient HT214B1-S1H2 and was associated with key genes such as SLC2A1, RNU5F-1, UQCRH, and NSUN4. Motif analysis revealed significant enrichment for transcription factors WT1, ZBTB7A, and SP family, known to play regulatory roles in BC progression. Patients with reduced chromatin accessibility in this region had poorer survival outcomes (log-rank P-value = 0.031). The second locus, within the ST8SIA1 gene (L1PA7 family, chr12:2,2,147,464–22,152,713(+)), was notably active in patient HT029B1-S1PC. It showed associations with genes including LINC01761, NGF, PHGDH, and NOTCH2, and was enriched for motifs of FOXA family, KLF5, and ZFX transcription factors, which are involved in cancer-related pathways. Consistently, lower accessibility at this locus correlated with reduced survival (log-rank P-value = 0.032). These findings suggest that these TE loci may harbor critical regulatory elements linked to BC progression.
Collectively, these results indicate that TE-locus accessibility varies across BC patients and is associated with clinical outcome. Notably, the direction of association differed by locus: for four loci, reduced chromatin accessibility was associated with better survival, whereas for two loci the opposite trend was observed. These analyses are correlative and do not establish a causal or mechanistic role for individual TE loci; functional follow-up will be required to determine whether any locus directly contributes to tumor progression or suppression.

Discussion

Discussion
Here we have demonstrated scTELL, a novel bioinformatics tool for locus-specific analysis of TEs at single-cell resolution and applied it to investigate TE accessibility in different cell types and states. This marks a much-needed addition to the landscape of chromatin accessibility analyses, now with the provision for precise locus-specific TE analysis, not previously enabled by existing single-cell ATAC-seq tools including Signac [17] and ArchR [18]. While excelling at gene-centric analyses, these methods systematically lack the ability to study the regulatory potential of individual loci. In particular, TE accessibility in scATAC-seq studies has often been summarized at the family or subfamily level, leaving a methodological gap for systematic insertion-level analyses of individual TE loci. scTELL addresses this gap by enabling locus-level analysis of TE-associated chromatin accessibility and its associations with phenotypic diversity and disease states.
One of the most significant strengths of scTELL lies in its seamless compatibility with widely used single-cell ATAC-seq pipelines such as Signac [17] and ArchR [18]. This ensures that researchers already working with these tools can easily integrate scTELL into their workflows without needing to start their analyses from raw data formats such as BAM or FASTQ files. scTELL directly utilizes pre-processed datasets, leveraging existing Signac objects or ArchR projects, which reduces computational burden and makes it accessible to a broad range of researchers. Consistent with this practicality, benchmarking in PBMC scATAC-seq showed that scTELL-derived TE locus matrices support cell-type separation with performance comparable to established gene activity scoring approaches.
Our findings provide added evidence of the potential clinical and biological utility of scTELL. Specifically, motif enrichment analyses around TE-associated accessible regions revealed an additional layer of regulatory interpretation, including family-level motif signatures, within-family locus heterogeneity across cell types, and motifs enriched in TE-associated peaks relative to gene promoters. In cancer datasets, we discovered TE loci associated with tumor-specific activity and clinical outcomes. In ccRCC, we identified a locus from the L1PA2 family that correlated with tumor-specific activity and worse PFI. In BC, survival-associated TE loci were identified with specific chromatin accessibility patterns associated with prognosis. These results highlight not only the role of TE activities in determining tumor heterogeneity but also testify to scTELL’s power to detect candidate biomarkers. Importantly, these clinical associations are correlative, and further prospective and functional validation will be required to determine whether individual TE loci play causal regulatory roles in disease progression.
Several limitations should be noted. First, scTELL relies on uniquely mapped reads, which may underrepresent the youngest TE subfamilies with high sequence similarity. Second, our analyses focused on longer, structurally intact elements; shorter regulatory sequences such as solo LTRs warrant future investigation. Finally, integration with matched scRNA-seq data would enable direct assessment of whether TE accessibility correlates with TE transcription and/or nearby gene expression at single-cell resolution.

Conclusions

Conclusions
In conclusion, scTELL provides a robust and accessible framework for researchers to investigate locus-specific TE-associated chromatin accessibility from single-cell ATAC-seq data, addressing a key limitation of existing chromatin accessibility analyses that largely rely on family-level summaries. scTELL offers new perspectives on the epigenetic heterogeneity across cell types and cancer states by enabling quantitative measurement of TE accessibility at individual insertion sites and the identification of cell type- and state-associated TE loci. In this study, we show that selected TE loci exhibit reproducible patterns across datasets and can be linked to tumor-specific accessibility and patient outcome associations (e.g., an L1PA2 locus in ccRCC and survival-associated loci in BC), suggesting their potential as candidate biomarkers. In addition, our evaluation of the scTELL scoring approach, TF motif enrichment analyses of TE-associated accessible regions, and bulk ATAC-seq validation support the utility of scTELL for interpreting TE-linked regulatory landscapes. While these clinical associations are correlative, prospective validation and functional studies will be required to establish clinical utility or therapeutic relevance. Overall, scTELL complements existing single-cell ATAC-seq workflows and provides a practical platform to advance studies of TE-mediated regulation in cancer and other biological contexts.

Supplementary Material

Supplementary Material

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기