본문으로 건너뛰기
← 뒤로

Large-Scale T-cell Receptor Repertoire Profiling Unveils Tumor-Specific Signals for Diagnosing Indeterminate Pulmonary Nodules.

1/5 보강
Cancer research 📖 저널 OA 49% 2024: 12/24 OA 2025: 48/86 OA 2026: 65/131 OA 2024~2026 2025 Vol.85(24) p. 5141-5160
Retraction 확인
출처

PICO 자동 추출 (휴리스틱, conf 2/4)

유사 논문
P · Population 대상 환자/모집단
107 patients with IPN validated the superior diagnostic performance of TCRnodseek plus over existing approaches.
I · Intervention 중재 / 시술
추출되지 않음
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
[SIGNIFICANCE] Construction of the largest TCR database of lung nodules enabled identification of lung cancer-specific TCR sequences and development of an advanced machine learning model to distinguish benign from malignant pulmonary nodules. This article is part of a special series: Driving Cancer Discoveries with Computational Research, Data Science, and Machine Learning/AI .

Luo H, Guo W, Luan X, Yue T, Yu S, Yin X

📝 환자 설명용 한 줄

[UNLABELLED] Indeterminate pulmonary nodules (IPN) are increasingly detected due to increasing health awareness and widespread lung cancer screening, yet distinguishing benign from malignant nodules r

이 논문을 인용하기

↓ .bib ↓ .ris
APA Luo H, Guo W, et al. (2025). Large-Scale T-cell Receptor Repertoire Profiling Unveils Tumor-Specific Signals for Diagnosing Indeterminate Pulmonary Nodules.. Cancer research, 85(24), 5141-5160. https://doi.org/10.1158/0008-5472.CAN-25-1407
MLA Luo H, et al.. "Large-Scale T-cell Receptor Repertoire Profiling Unveils Tumor-Specific Signals for Diagnosing Indeterminate Pulmonary Nodules.." Cancer research, vol. 85, no. 24, 2025, pp. 5141-5160.
PMID 41150899 ↗

Abstract

[UNLABELLED] Indeterminate pulmonary nodules (IPN) are increasingly detected due to increasing health awareness and widespread lung cancer screening, yet distinguishing benign from malignant nodules remains a critical challenge. Emerging evidence suggests that recognizing cancer-associated immune signatures represents a powerful approach for early-stage cancer detection. This study explored the clinical utility of T-cell receptor (TCR) repertoire analysis in IPN evaluation. By conducting large-scale TCR sequencing (6,059 blood and 988 tumor samples), we established LungTCR (https://www.lungtcr.com/), a comprehensive TCR repertoire database, and proposed a method for the quantitative assessment of tumor-related immune responses. LungTCR was leveraged to develop TCRnodseek plus, a diagnostic model integrating clinical data, CT imaging, and TCR features. A multicenter prospective study (ChiCTR2200055761) involving 1,107 patients with IPN validated the superior diagnostic performance of TCRnodseek plus over existing approaches. Mechanistic analyses revealed that the identified lung cancer-related TCR clones are enriched in non-small cell lung cancer and are predominantly present in malignant nodules and tumor tissues. This study provides a robust TCR database and an advanced diagnostic model, offering a framework for precise IPN differentiation.

[SIGNIFICANCE] Construction of the largest TCR database of lung nodules enabled identification of lung cancer-specific TCR sequences and development of an advanced machine learning model to distinguish benign from malignant pulmonary nodules. This article is part of a special series: Driving Cancer Discoveries with Computational Research, Data Science, and Machine Learning/AI .

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

📖 전문 본문 읽기 PMC JATS · ~118 KB · 영문

Introduction

Introduction
Indeterminate pulmonary nodules (IPN) pose a significant diagnostic challenge for clinicians due to the absence of definitive radiologic features distinguishing benign from malignant lesions (1). Low-dose computed tomography is widely used for pulmonary nodule detection and lung cancer screening; however, its clinical utility is limited by a high false-positive rate. Notably, more than 96% of positive findings in low-dose computed tomography–based lung cancer screening are ultimately benign, leading to unnecessary follow-up in approximately 72% of patients (2). Among these cases, 18.5% of patients undergo surgery or other invasive procedures, resulting in overtreatment (3). Incidental CT scans reveal that 30% of pulmonary nodules are diagnosed as IPNs (4), yet only 3.7% of these cases are ultimately malignant (5). A substantial proportion (12%–50%) of surgically resected IPNs is later confirmed to be benign (6, 7). Collectively, IPNs are becoming more common due to increased health awareness and widespread lung cancer screening. The regular reexamination and unnecessary surgeries caused by IPNs result in the waste of healthcare resources, unnecessary radiation exposure, and increased psychologic stress. Therefore, improving the diagnostic accuracy of IPNs is critical to reducing overtreatment and optimizing clinical decision-making.
Currently, numerous studies have investigated tumor-associated markers from various biological sources, such as airway epithelial cells, exhaled breath, bronchoalveolar lavage fluid, sputum, blood, saliva, and urine to achieve lung nodule diagnosis (8). However, accurately distinguishing IPNs remains a significant challenge. We have also explored multiple approaches to enhance the differential diagnosis of IPNs (9–13), but this critical clinical challenge remains unresolved, highlighting the urgent need for more effective and reliable diagnostic strategies.
Tumor initiation and progression are considered to result from complex interactions between malignant cells and the immune system (14). The degree of immune cell infiltration, functional status, and spatial distribution in the tumor microenvironment are critical factors influencing tumor development and patient prognosis (15). Multiple studies have demonstrated that lung cancer tissue exhibits increased T-cell clonality and distinct transcriptional profiles compared with normal lung tissue (16, 17). During cancer progression, the tumor-infiltrating T cells recognize tumor-specific antigens, leading to the clonal expansion of T cells with unique T-cell receptor (TCR) sequences (18). These tumor-reactive T cells can be detected in peripheral blood, providing a noninvasive means to monitor tumor-specific immune responses (19). The application of TCR repertoire profiling in peripheral blood for lung cancer screening has emerged as a promising and innovative strategy. Compared with ctDNA, which often suffers from low abundance and high background noise in early-stage cancers, TCR-based approaches benefit from the immune system’s ability to amplify tumor-associated signals through clonal expansion. This results in a higher signal-to-noise ratio, making immune repertoire profiling particularly advantageous for detecting early-stage cancers.
Additionally, our previous work demonstrated an association between TCR immune repertoire characteristics and clinical features of lung nodules (9, 20). We developed the TCRnodseek model for IPN discrimination, achieving a positive predictive value (PPV) of 0.95 and an AUC of 0.8 in an independent validation cohort. This approach offers advantages such as minimal sample requirements, robust target stability, high interpretability, and superior diagnostic accuracy compared with existing methods, making TCR-based detection a valuable tool for precise lung nodule management. However, limitations in our prior study, including a single-center design, limited sample size, and lack of a healthy reference cohort, highlight the need for further refinement.
Emerging evidence underscores the potential of TCR repertoire analysis for early-stage lung cancer detection through cancer-associated immune signatures. For example, one study explored blood TCR sequencing (TCR-seq) to develop tumor immune lymphocyte scores, demonstrating predictive value in lung cancer screening cohorts but without integrating clinical or imaging data for comprehensive IPN evaluation (21). Another investigation leveraged TCR repertoire functional units to enhance liquid biopsy sensitivity for early-stage lung cancer, yet it lacked multicenter validation and broader clinical applicability (22). Additionally, research on multiple synchronous lung cancers using single-cell TCR-seq revealed diverse TCR profiles across disease stages but did not focus on scalable diagnostic models for IPN differentiation (23). These studies, while promising, are constrained by factors such as single-center designs, small cohorts, and limited integration with multimodal clinical data. Furthermore, there remains a gap in comprehensive, publicly accessible TCR databases and diagnostic models that combine TCR features with clinical and imaging data for robust IPN differentiation across diverse populations.
Here, we address these gaps by establishing LungTCR, a comprehensive TCR repertoire database derived from 6,059 blood and 988 tumor samples, and introducing TCRnodseek plus, an advanced diagnostic model integrating TCR features, clinical data, and CT imaging. Validated in a multicenter prospective study (ChiCTR2200055761) with 1,107 patients with IPN, our approach demonstrates superior diagnostic performance, enhanced interpretability, and mechanistic insights into lung cancer-related TCR clones, offering a novel framework for precise IPN differentiation and advancing the clinical management of lung nodules.

Materials and Methods

Materials and Methods

Study design, participants, and sample collection
All experimental plans and study protocols were submitted for ethical review and approval by the respective ethics and licensing committees of the participating hospitals prior to the initiation of the clinical study. Approval was obtained from the corresponding Institutional Review Boards of each hospital. Informed written consent was obtained from all participants prior to enrollment and sample collection in accordance with the Declaration of Helsinki, ensuring they were fully informed about the purpose of sample collection and how their test results would be used. All experiments, methodologies, procedures, and personnel training were carried out in accordance with the relevant guidelines and regulations of participating hospitals and laboratories.
To identify key TCR features associated with lung cancer and develop a lung cancer-associated quantitative assessment approach, a total of 7,047 samples were collected from the Shenzhen Institutes of Advanced Technology. These samples comprised 2,699 blood samples from asymptomatic healthy individuals and 4,348 lung cancer samples, including 3,360 blood samples; 709 formalin-fixed, paraffin-embedded tissue samples; and 279 fresh tissue samples. For the construction of a diagnostic model for indeterminate lung nodules (IPNs), an additional 1,107 blood samples were collected, with 894 samples from Sichuan and 213 samples from Shenzhen. This cohort consisted of 254 benign cases and 853 malignant cases. The study population was further divided into three groups: the training group (754 samples), the testing group (134 samples), and an independent validation group (218 samples). Among the subjects with malignancy, 96% were classified as Tis or I stage lung cancer, indicating that the study primarily focused on early-stage lung cancer diagnosis (Table 1).

TCR library preparation and TCR-seq
TCRβ-chain sequencing was performed by utilizing DNA extracted from tumor and peripheral blood mononuclear cell (PBMC) samples, following a multiplex PCR (mPCR) approach with optimized panels of V and J primers. mPCR amplification of the CDR3 regions of the TCRβ-chain was constructed by a two-step PCR pipeline (Supplementary Fig. S1A). In brief, about 0.5 µg of DNA templates was amplified with a designed set of 42 V forward and 14 J reverse primers to generate amplicons. Each primer consists of a specific sequence targeting different V or J alleles and a universal sequence for the second round of PCR. Next, universal primers with indexed adapters were utilized to construct the TCR-seq library. Each sample library was sequenced on the Illumina NovaSeq platform (RRID: SCR_016387), generating an average of 2 Gb of raw data per sample.

TCR repertoire analysis
Raw FASTQ data were trimmed and processed using FASTP 0.19.7 software (RRID: SCR_016962; ref. 24) to remove adapters and low-quality reads, with parameters -t 1 -T 1. Seqkit 2.3.0 (RRID: SCR_018926; ref. 25) was used to extract reads containing primers used in mPCR amplification with the parameters grep -s -i -d. Paired-end reads containing primers were merged into full-length sequences using Flash 1.2.11 (RRID: SCR_005531; ref. 26), with the following parameters : -m 10 -M 150 -p 33 -r 150 -x 0.1. TCR repertoires were profiled from merged sequences using MiXCR software 3.0.4 (RRID: SCR_018725; ref. 27) by aligning against human TCRβ gene segments in the IMGT database (RRID: SCR_012780; https://www.imgt.org), and assembled TCR clonotype data were exported. VDJtools 1.2.1 (28) was used to convert the TCR clonotype files to a VDJtools-compatible format. For quality control, TCR data were excluded from further analysis if they did not meet the following criteria: raw data yield ≥2 Gb, percentage of reads containing V/J primers ≥85%, and percentage of CDR3-aligned reads among all sequenced reads ≥70%. A computational framework named TCRFeature was developed for this study.

Validation of PCR amplification bias in TCR library construction
To evaluate and control for potential PCR amplification bias during TCRβ-chain library construction, we conducted two validation experiments:

PCR cycle consistency test
Genomic DNA was extracted from randomly selected peripheral blood samples and subjected to mPCR amplification using the same T-cell receptor beta (TRB) locus primer panel (42 V-gene forward primers and 14 J-gene reverse primers) used in this study. The first-round PCR was run for 24 cycles, and the resulting product was divided into three equal aliquots. Two aliquots were further amplified for an additional three and six cycles, resulting in three products with 24, 27, and 30 total PCR cycles. All three amplified products were subjected to sequencing using the standard library preparation pipeline. The V–J gene segment usage in each sample was quantified and compared across conditions. The Pearson correlation coefficients (R2 > 0.97) between the three groups indicated that different amplification cycles did not introduce significant bias in V/J gene representation (Supplementary Fig. S1B and S1C).

Primer efficiency assessment using synthetic TCR templates
We designed 48 synthetic TCR DNA templates representing 12 distinct V genes and 4 J genes (48 V–J gene combinations). Each template included a unique 16 bp barcode and two universal marker sequences to allow identification after sequencing. These synthetic sequences were mixed at equimolar concentrations to form a pooled template mixture. The synthetic pool was used as the input for library construction and subjected to the same 30-cycle mPCR amplification and sequencing process. The relative abundance of each template in the sequencing output was compared with its expected frequency. Results showed low variability among different V–J combinations, indicating minimal primer-specific amplification bias (Supplementary Fig. S1D).

TCR repertoire diversity measurement
The diversity reflects the frequency and distribution of distinct T-cell clonotypes. Here, we utilized several diversity indices to quantify the TCR repertoire, including the Shannon–Wiener index, Inverse Simpson index, D50 index, Evenness (Pielou) index, Chao1 estimate, and Efron–Thisted estimate. To control for differences in library size across samples, we performed downsampling prior to diversity analysis. Specifically, TCR repertoire diversity metrics were calculated using VDJtools with uniform downsampling to 1,000,000 reads per sample. This was repeated across 10 independent iterations, and averaged values were used to ensure robustness. Unless otherwise stated, the above diversity indices were calculated using the CalcDiversityStats function in VDJtools, with the parameter -x 1,000,000. The TCR clonality was estimated as 1 − Pielou index, providing a measure of clonal expansion within the TCR repertoire:where pi is the frequency of clone i and N is the total number of clones.

Definition of nonproductive TCR and convergence
The CDR3 amino acid (CDR3aa) sequences with stop codons or frameshifts were defined to be nonproductive (nonfunctional) amino acid sequences, making them incapable of forming a functional TCR. A convergent TCR was defined as a TCR clonotype with identical CDR3aa sequences but differing CDR3 nucleotide sequences, indicating convergent recombination events. The convergence feature was computed using the following equations:where C is the number of convergence clones, N is the total number of observed clones in a TCR repertoire, pi is the frequency of clone i, and ni is the reads count of clone i.

TCR V/J gene abundance and frequency group of TCR clonotypes
V gene abundance for a specific V gene x or J gene abundance for a specific J gene y were calculated using the following equations:
After excluding nonproductive TCR sequences, the frequency of remaining TCR sequences was calculated prior to analysis. To facilitate characterization, TCR sequences were classified into frequency groups based on the following criteria: expanded group: frequency ≥ 1E−2; large group: frequency between 1E−2 and 1E−3; medium group: frequency between 1E−3 and 1E−5; small group: frequency between 1E−5 and 1E−6; and rare group: frequency ≤1E−6. For each TCR frequency group, the total sequence frequency was computed by summing the frequencies of all sequences within that group. This classification scheme provides insights into the distribution characteristics of TCR clones, enabling a deeper understanding of clonal expansion within the immune microenvironment.

Length group of TCR clonotypes and amino acid composition
Nonproductive TCR sequences and TCR clones with reads count <10 in each sample were removed prior to analysis. The remaining TCR sequences were then categorized into length-based groups according to their CDR3aa length: group A: ≤10 amino acids, group B: 11 to 13 amino acids, group C: 14 to 16 amino acids, group D: 17 to 19 amino acids, and group E: ≥20 amino acids. For each TCR length group, the total sequence frequency was determined by summing the frequencies of all sequences within that group.
For amino acid composition features, three amino acids from both ends of the CDR3aa sequence were removed, and then the adjusted abundance Sa for amino acid a is calculated by the following equation:where ki is the occurrence number of amino acid a in clone i, li is the CDR3aa length of clone i, A is the number of clones that contain amino acid a, N is the total number of observed clones, and ni is the reads count of clone i.
Due to the highly skewed distribution of TCR clone frequencies, we applied a natural logarithmic transformation to count-derived features (e.g., gene usage, amino acid composition, enrichment scoring). This not only stabilizes variance but also enhances the detection of low-frequency, antigen-driven TCRs that may be biologically important.

Identify lung cancer–enriched CDR3 sequences
To identify TCR sequences enriched in lung cancer, TCR sequences from 998 tumor tissue samples, 3,360 blood samples from patients with lung cancer, and 2,699 blood samples from healthy individuals were analyzed. TCR sequences from the blood of healthy individuals served as the control group. The following steps were applied to generate enriched CDR3 sequences for the lung cancer tumor (LCT) and lung cancer blood (LCB) groups: (i) filtering TCR clonotypes with low reads and removing CDR3 clones with a read count of less than 10 in each sample. (ii) Aggregating CDR3 sequences within groups: For each sample in the control group, lung cancer tissue group, and LCB group, combine all filtered TCR sequences within the respective group. The number of occurrences (database_sequence_count, S) and a scoring metric (Database_Support_Score, DSS) were calculated for each unique TCR sequence. For a given CDR3 sequence i in the database: Let S represents the number of samples in which CDR3 sequence i occurs and nij represents the reads count of sequence i in sample j; the DSS of sequence i is calculated as follows:
(iii) Generating control group TCR set by selecting CDR3 sequences that are present in more than five samples of the control group (healthy individuals). (iv) Identifying lung cancer–enriched CDR3 sequences: For both the LCT group and LCB group, CDR3 sequences present in the control group TCR set were removed. Additionally, CDR3 sequences with occurrences of less than k samples within the lung cancer group were removed. (v) The threshold k is determined according to the total number of remaining TCR clonotypes. In this study, a cutoff of 4,000 TCR clonotypes was used to define k. (vi) To identify lung cancer–enriched TCR sequences, we varied the occurrence threshold (k) and plotted the number of sequences retained. The value of k was selected at the point at which the number of sequences began to plateau (k = 15 for LCT and k = 17 for LCB), resulting in approximately 3,600 to 3,800 enriched sequences. This choice ensured balanced sequence size while avoiding inflation of false-positive signals in downstream enrichment score calculations. As a result, 3,652 CDR3 sequences were identified as enriched in lung cancer tissue, and 3,840 CDR3 sequences were identified as enriched in LCB. (vii) A similar approach was employed to identify CDR3 sequences enriched in EGFR- and KRAS-mutant LCB groups. (viii) CDR3 sequences detected in ≥150 samples within the control group were classified as healthy public CDR3 sequences (healthy public). Additionally, 3,844 CDR3 sequences were identified as enriched in healthy blood, using a threshold k = 74 when compared with lung cancer samples.

Lung cancer–enriched TCR score calculation
To compute the lung cancer enrichment TCR score (LCS), the input TCR sequences were compared with lung cancer–enriched sequences if they had the same length. The lung cancer–enriched score was computed through the following steps: Step 1: For each query sequence, we identified lung cancer–enriched sequences of equal length (n). The edit distance between each query sequence and database sequence was calculated. Only sequence pairs meeting the criterion edit_distance ≤0.3 × n were retained. Step 2: For retained sequence pairs from step 1, an alignment score (AS) was computed using the Needleman–Wunsch algorithm with the following parameters: open penalty: −10, extend penalty: −1, and amino acid substitution scoring matrix: blosum62. Sequence pairs were retained if their AS ≥4.5 × n. The AS was calculated by Parasail (RRID: SCR_021805; ref. 29). Step 3: The remaining query sequences were defined as matched sequences. We sum the number of matched sequences as M in the query sample and calculate the features using the following formulas:
The query_seq_score (QSS) for sequence j is calculated as follows:where nj is the reads count of clone j.
The lung cancer–enriched TCR clones fraction for each sample is calculated as follows:
The LCS for each sample is calculated as follows:where Q is the number of observed clones in the query sample, M is the number of matched clones in the query sample, and nj represents the reads count of clone j in the sample.
We developed several subgroup-specialized scores for different biological contexts:
LCT score (LCTS) refers to the enrichment score by aligning against the lung cancer tissue–enriched TCR dataset.
LCB score (LCBS) refers to the enrichment score by aligning against the LCB-enriched TCR dataset.
EGFR score refers to the enrichment score by aligning against the EGFR-mutant LCB-enriched TCR dataset (specifically detects EGFR mutant–associated TCR signatures).
KRAS score refers to the enrichment score by aligning against the KRAS-mutant LCB-enriched TCR dataset (specifically detects KRAS mutant–associated TCR signatures).

The Uniform Manifold Approximation and Projection using TCR features
For Uniform Manifold Approximation and Projection (UMAP) visualization, each repertoire was represented as a 123-dimensional feature vector combining diversity indices (n = 5), V/J gene usages (n = 75 TRB genes), CDR3aa properties (n = 20 residues), CDR3 length distributions (n = 5), clone frequency distributions (n = 5), TCR convergence (n = 3), and LCT/LCB enrichment scores/fractions (n = 10). All features were z-score normalized prior to dimensionality reduction.

Feature selection and model construction
The discovery dataset was split into training and testing sets using the scikit-learn (RRID: SCR_002577) train_test_split function, with a test size of 20%. Stratification was applied to ensure a balanced distribution of benign and malignant samples. The training dataset was used for feature selection and model training, both of which were facilitated by the caret (RRID: SCR_021138) machine learning framework (version 6.0-92).
We employed two complementary methods for feature selection: ensemble feature selection (30) and Boruta (RRID: SCR_016234) feature selection (31). Ensemble feature selection combined the results from three feature selection algorithms: glmnet (generalized linear model with elastic net regularization, RRID: SCR_015505), random forest (RRID: SCR_015718), and recursive feature elimination. Each algorithm generated feature importance reflecting a feature’s contribution to the model’s predictive power. These importance values were normalized to a maximum value of 100 and then aggregated. The top 30 features with the highest cumulative importance were selected. For Boruta, features classified as “confirmed” or “tentative” by Boruta were selected. The final selected features consisted of the combined features selected from both methods.
Prior to training, selected features underwent preprocessing using caret, including centering, scaling, and removal of highly correlated features. A fivefold cross-validation with five repeats was implemented for model training to ensure robustness. We evaluated multiple machine learning algorithms, including random forest (RRID: SCR_015718), gradient boosting machine (RRID: SCR_017301), glmnet (RRID: SCR_015505), linear support vector machine, and radial SVM. The area under the ROC curve was used as the primary performance metric. Grid search was utilized for hyperparameter optimization. The predicted probability of malignancy was generated as the prediction score. Confidence intervals (CI) for the ROC curve were calculated using the bootstrap method. To address the class imbalance in the original dataset, the oversampling algorithm synthetic minority oversampling technique (32) was applied to the training data.

Mutation profile and tumor mutational burden calculation
To characterize the mutation profile and calculate the tumor mutational burden (TMB), whole-exome sequencing (WES) was conducted on DNA extracted from lung cancer tissue specimens. The hybridization-based target enrichment panel, HaploX WESPlus panel, was employed for identifying coding variants. Raw data processing, sequence alignment, and variant calling were conducted following an established pipeline previously described by Zhao and colleagues (33). TMB was calculated as the number of nonsynonymous mutations per megabase of the exonic coding regions.

HLA evolutionary divergence
HLA genotyping was also determined by WES, and xHLA (34) was used for HLA allele typing. HLA evolutionary divergence was calculated by midasHLA (35), with the average HLA evolutionary divergence of HLA-A, HLA-B, and HLA-C loci used as the final HLA evolutionary divergence score.

PD-L1 expression
PD-L1 expression was determined by IHC on formalin-fixed, paraffin-embedded tissue sections, following the protocol outlined by Wu and colleagues (36). PD-L1 expression levels were categorized into three groups based on the tumor proportion score (TPS): negative (TPS ≤ 1%), low expression (1% < TPS ≤ 50%), and high expression (TPS > 50%).

Independent dataset to validate different scoring methods in LungTCR
This study utilized four distinct datasets, including non–small cell lung cancer (NSCLC), healthy controls, COVID-19, and other cancer types. The NSCLC dataset consisted of 382 samples, including 132 PBMC samples and 250 tumor tissue samples, obtained from NCBI Sequence Read Archive (SRA) raw data and multiple studies in the immuneACCESS repository. The COVID-19 dataset included 1,034 PBMC samples obtained from the immuneACCESS repository (https://doi.org/10.21417/ADPT2020COVID). The healthy control dataset consisted of 195 PBMC samples. The other cancer dataset contained 118 PBMC samples and 15 tumor samples from five different cancer types, including thyroid cancer, melanoma, breast cancer, classical Hodgkin lymphoma, and head and neck cancers. For NCBI SRA data, samples were downloaded using the SRA-Toolkit, and SRA files were converted to FASTQ format using fasterq-dump. The FASTQ files were then processed using Fastp for quality control, removing low-quality reads to minimize sequencing errors in TCR sequences. A quality threshold of Q25 was applied. Finally, TCR sequences were extracted using the MiXCR software. For immuneACCESS data, TCR sequence files were directly downloaded from the repository and processed.
All TCR sequences obtained from NCBI and immuneACCESS underwent further processing to ensure high authenticity and quality: Only sequences containing V and J genes, CDR3 sequence, and nucleotide sequence were retained. Only sequences with a CDR3 length between 7 and 25 were kept (representing 99.9% of the sequences). Only CDR3 sequences that start with a cysteine and end with a phenylalanine or tryptophan and do not contain any stop codons were retained, according to the international ImMunoGeneTics information system (IMGT) standards. Duplicated TCR sequences with identical CDR3 nucleotide sequences, V gene, and J gene were merged, and their frequencies were summed. Different alleles within the same TRBV/J gene were merged, and only the TRBV/J gene was retained. The dataset was reformatted with standardized column names: “count” (TCR clone count), “freq” (TCR clone frequency), “cdr3nt” (CDR3 nucleotide sequence), “cdr3aa” (CDR3aa sequence), “v,” “d,” and “j.” This comprehensive data processing pipeline ensured high-quality, standardized TCR sequence data across the various datasets, allowing for consistent and reliable downstream analyses.

Functional analysis of lung cancer–enriched CDR3 sequences
To explore the potential functions of identified lung cancer–enriched CDR3 sequences, we performed TCR annotation based on the sequence similarity to experimentally validated pairs of TCR epitope. Two databases were used for annotation: NeoTCR database (37), containing TCR sequences for tumor-specific antigens (TSA), and Immune Epitope Database (RRID: SCR_006604; ref. 38), containing sequences associated with pathogenic microorganisms.
The similarity is measured based on sequence alignment using TCRMatch (39), which was employed to identify matches between identified lung cancer–enriched CDR3β sequences and TCRβ-chains in the aforementioned databases with a threshold of 0.95. For sequences annotated using the Immune Epitope Database, we further categorized the epitopes according to the antigen types and source organism.

TCR–epitope binding score prediction
Interaction map recognition (ImRex; ref. 40) was used to predict the binding score between lung cancer–enriched CDR3β sequences and two groups of epitopes: lung cancer TSA and tumor-associated antigen) epitopes. Lung cancer TSA epitopes were acquired from NeoTCR database. Tumor-associated antigen epitopes were obtained from the Cancer Antigenic Peptide Database (CAPED, https://caped.icp.ucl.ac.be/) that were also documented in VDJdb (41), with a score >0. TCR–epitope pairs with a predicted binding score ≥0.5 by ImRex were selected for further analysis.

Comprehensive features of radiology
Radiologic assessment plays a critical role in the diagnosis of lung nodules. To minimize variability due to individual interpretation, this study invited an intermediate-level radiologist to retrospectively analyze the admission CT scans of patients from the Sichuan Cancer Hospital, following the same standardized evaluation criteria. A total of 23 indicators were assessed, including nodule type, shape, margin, interface characteristics, lobulation, spiculation, pleural indentation, air bronchogram, vacuole sign, and vessel convergence. Based on these indicators and clinical experience, the radiologist made a subjective determination about the benign or malignant nature of each pulmonary nodule (referred to as rDOC_diagnosis).

Results

Results

The study design and participants’ baseline characteristics
This study employed a two-phase translational research design (discovery to validation) to develop a TCR-based diagnostic approach for lung cancer, with a particular focus on IPNs (Fig. 1; Table 1). The prospective multicenter cohort study, initiated in 2021, included participants from Sichuan Cancer Hospital, Shenzhen Institutes of Advanced Technology, and Peking University Shenzhen Hospital, with ethical approval (registration: ChiCTR2200055761). In the discovery phase, we constructed a comprehensive lung cancer TCR database using a large-scale cohort (n = 7,047; Supplementary Table S1A) and systematically identified cancer-enriched TCR signatures through repertoire analysis. In the validation phase, we developed a TCR-based estimate of cancer-related immune signals and built a predictive model for the diagnosis of IPNs, followed by multicenter performance validation.

Methodology optimization for the TCR repertoire profiling
In our previous study, we utilized one-step mPCR to construct the sequencing library (9). However, this approach had several drawbacks, including reduced yield of high-quality sequencing data, low amplification efficiency, and suboptimal coverage of the TCR region. These limitations resulted in a substantial waste of sequencing data and higher costs. To optimize this method, we developed an optimized PCR amplification approach. This approach ensures that all amplified sequences meet the sequencing requirements. In contrast, previous methods may have been affected by ligation efficiency, resulting in the generation of ineffective sequencing data. To evaluate and control for potential PCR amplification bias during TCRβ-chain library construction, we conducted a PCR cycle consistency test and multiplex primer efficiency assessment using synthetic TCR templates (Supplementary Fig. S1B–S1D). The results showed low variability among different V–J combinations, indicating minimal primer-specific amplification bias. To evaluate the improvements, we applied the optimized method to a validation dataset (34 samples) from prior studies and conducted a paired comparison of quality control metrics and TCR diversity indices (Supplementary Table S1B). Specifically, the optimized method demonstrated a substantial increase in quality control indicators (P < 0.001, Supplementary Fig. S2A and S2B). Meanwhile, we found that the indices related to TCR diversity (Pielou index) showed a significant increase (P < 0.001, Supplementary Fig. S2C and S2D).
To maximize the extraction and biological interpretation of TCR repertoire data, we developed a computational framework named TCRFeature, which systematically computes 124 repertoire features across eight major categories (Supplementary Table S1C). These features include classical diversity indices (e.g., Shannon entropy, Simpson index, clonality), clonal frequency distributions, V/J gene usage, CDR3aa composition, CDR3 length distribution, convergence metrics, and cancer-related TCR signatures. These features provide quantitative metrics that comprehensively characterize the TCR repertoire at multiple levels. This multidimensional profiling offers a robust and scalable foundation for downstream immune landscape analysis and predictive modeling.

Lung cancer exhibits differential TCR characteristics and diversity
To identify lung cancer–associated TCR features, we analyzed a total of 7,047 samples, including 988 tumor specimens, 3,360 peripheral blood samples from patients with diagnosed lung cancer, and 2,699 peripheral blood samples from healthy individuals. TCRβ-chain sequencing was performed using our optimized TCR profiling method. Additionally, WES was conducted on 319 tissue samples to explore the tumor somatic variations in lung cancer (Supplementary Table S1D and S1E). Based on TCR and mutation information, we constructed a comprehensive lung cancer TCR database (https://www.lungtcr.com/), which reveals the distribution and frequency of cancer-associated TCR clones in LCT and blood (Fig. 2A). This database provides functionalities for data browsing, data analysis, and data downloading, allowing users to explore the available data, perform their own analyses, and download datasets for further analysis or offline use.
To investigate the relationship between TCR clone distribution and clinical characteristics, we classified TCR clones into five categories: hyperexpanded, large, medium, small, and rare clone types (Fig. 2B). Comparative analysis revealed that healthy individuals had a significantly higher proportion of small-frequency clones, whereas the malignant group exhibited a greater prevalence of large-frequency clones (P < 0.001, Supplementary Fig. S3A and S3B). We performed an analysis of full clone size distributions across all samples and computed the mean clone frequency distribution within groups and observed that lung cancer samples exhibited a greater density of high-frequency TCR clones, as demonstrated by Supplementary Fig. S3C and S3D. This confirmed that lung cancer samples exhibited a heavier tail toward high-frequency clones, consistent with clonal expansion patterns. Together, these visualizations highlight distinct clone frequency distribution features between malignant and nonmalignant conditions, suggesting possible clonal expansion events associated with tumor-related immune responses. Accordingly, lung cancer samples (both tumor tissue and peripheral blood) exhibited a significantly lower Shannon index, reflecting reduced TCR clonal diversity compared with healthy controls (P < 0.001, Supplementary Fig. S3E).
To further assess the TCR repertoire in lung cancer and healthy individuals, we analyzed the overlap of TCR clones across healthy blood, LCB, and LCT samples and visualized the distribution using a Venn diagram (Fig. 2C), highlighting malignancy-associated TCR clones uniquely found in patients with lung cancer. Shared TCR clones between LCT and LCB (470,257) markedly exceeded those with healthy controls (170,616).
To identify lung cancer–associated CDR3β amino acid sequences (CDR3β aaSeq), we performed coexistence analysis to filter commonly shared TCRs across samples. An elbow plot illustrated the relationship between unique CDR3β aaSeqs and the filtering threshold (the number of individuals in which the CDR3β aaSeqs are detected) for group-enriched sequences (Fig. 2D). This analysis yielded 3,653 lung cancer tissue–enriched CDR3β aaSeqs, each observed in ≥15 individuals. Additionally, 3,840 CDR3β aaSeqs were identified as enriched in LCB, with each detected in >17 patients (Supplementary Table S2A–S2C). We evaluated the enrichment level of each CDR3β aaSeq identified in LCT and LCB by analyzing their frequency and occurrence within each group. The enriched CDR3β aaSeqs differed between the LCT and LCB groups (Fig. 2E).
In parallel, we identified TCR sequences commonly shared across healthy peripheral blood samples, as well as those significantly enriched in healthy individuals compared with patients with lung cancer (Supplementary Fig. S3F; Supplementary Table S2D–S2G).
We performed an in-depth analysis of the CDR3β aaSeq length distribution among the lung cancer–enriched TCRs and compared them with healthy public (healthy-enriched TCRs; details in Materials and Methods) and random TCR sequences. Healthy controls showed enrichment at 14 amino acids, whereas both LCT- and LCB-enriched TCRs exhibited a modest shift toward longer CDR3β aaSeq lengths compared with healthy groups (Supplementary Fig. S3G–S3J). These findings suggest that a minor length-based difference exists in lung cancer–associated TCR sequences.

Evaluation of lung cancer–related immune signals based on cancer-enriched CDR3β
To assess the biological relevance of the identified lung cancer tissue–enriched (LCT-enriched) TCRs, we externally validated 3,653 LCT-enriched TCRβ clones using paired samples from independent datasets (immunoSEQ24; PRJNA422601). These clones showed marginally higher abundance in tumors versus adjacent tissue (P = 0.4) and were largely undetectable in noncancer-draining lymph nodes, supporting their enrichment in the tumor microenvironment and potential relevance to cancer-related immune responses (Supplementary Fig. S4A–S4C). To verify that the identified lung cancer–enriched TCRs represent disease-associated rather than randomly shared public sequences, a null–control comparison was performed by randomly splitting healthy samples into two subgroups (n = 1,000 each) and applying the same enrichment pipeline. Results confirmed disease-specific enrichment, with LCB samples (n = 1,000) retaining significantly more sequences under identical thresholds (Supplementary Fig. S4D). These findings suggest that TCR sharing in lung cancer exceeds background levels expected from healthy individuals. Furthermore, to evaluate whether these sequences could arise from random recombination, we calculated generation probabilities using the optimized likelihood estimate of immunoglobulin amino acid sequences model (42). Compared with healthy public and randomly sampled TCRs, both LCT- and LCB-enriched sequences exhibited significantly lower generation probability values, indicating a lower likelihood of stochastic generation and supporting antigen-driven selection (Supplementary Fig. S4E). Together, these results support the biological relevance and disease specificity of the identified lung cancer–associated TCRs.
Building upon evidence for the biological relevance of LCT-enriched TCRs, we next developed the LCS to quantify tumor-related immune signals within individual TCR repertoires. The LCS integrates three key features: the presence of LCT-enriched CDR3β aaSeqs, clonal similarity metrics, and clinical relevance. This composite score comprises two components: the LCTS and the LCBS (Supplementary Fig. S4F; Materials and Methods). To systematically evaluate the discriminatory power of TCR signatures, we performed UMAP clustering by integrating LCS, repertoire diversity metrics, V/J gene usage, and CDR3 compositional features. This analysis revealed clear clustering between lung cancer and healthy individuals (Fig. 2F). Motif analysis of the top 100 LCT-enriched TCRs revealed a unique CDR3aa pattern, providing a visual summary of sequence-level characteristics enriched in lung cancer and showing differences compared with patterns in LCB-enriched TCRs (Fig. 2G). Both LCTS and LCBS were significantly elevated in patients with lung cancer, particularly in tumor samples, compared with healthy controls (P < 0.001, Fig. 2H; Supplementary Fig. S4G and S4H), consistent with clinical characteristics.
To assess the utility of lung cancer–enriched CDR3β sequences for detecting cancer-associated signals, we analyzed TCR-seq data from independent cohorts using TCRdb and immuneACCESS (Supplementary Table S2H). Using LungTCR’s analytical tools, we computed 124 TCR features, including LCTS and LCBS, from raw sequencing data (Materials and Methods). LCT exhibited significantly higher LCTS than healthy controls (P < 0.001), supporting its potential as a cancer-specific marker. Elevated LCS was also observed in breast cancer and cutaneous melanoma but not in patients with COVID-19, indicating specificity for cancer-related immune signals over general immune activation (Fig. 2I). These results validate the robustness of LCS in distinguishing cancer-specific TCR signatures from noncancer immune responses.

TCR characteristics are associated with tumor mutational profiles
To investigate the relationship among LCS, TCR repertoire characteristics, and tumor mutational profiles, we performed an integrated analysis in 319 LCTs with both TCR-seq and WES data. We observed a clear correlation among the LCTS, LCBS, TCR clonal diversity, Shannon index, and TMB (Fig. 3A). These findings suggested that CDR3β aaSeqs enrichment in lung cancer may be driven by tumor neoantigens and could be indicative of antitumor immune responses. Additionally, T-cell clone Shannon entropy showed a negative correlation with TMB (r = −0.24, P < 0.001) and maximum variant allele frequency (mVAF), which confirmed that cancer-related mutations lead to specific TCR clonal expansion. A slight correlation was observed between LCTS and PD-L1 expression, assessed using TPS. Nonsynonymous mutations in TP53 and EGFR are the most frequent in lung cancer, followed by KRAS, CSMD3, LRP1B, APC, and PIK3CA (Fig. 3B). Consistent with previous studies, higher TMB, which is viewed as being associated with increased T-cell activation, was likely driven by the interactions between TCRs and tumor neoantigens, displaying a positive correlation with LCTS (r = 0.22; P < 0.001). Similarly, higher mVAF, indicative of a larger proportion of tumor cells, was positively correlated with LCTS (r = 0.35; P < 0.001; Fig. 3C and D).
We next explored the relationship between the canonical oncogenic driver mutations and TCR repertoire characteristics. EGFR-mutant tumors demonstrated significantly higher TCR Shannon entropy compared with EGFR wild-type tumors, whereas KRAS-mutant tumors exhibited significantly lower Shannon entropy than their wild-type counterparts (Fig. 3E). These results suggest that EGFR mutations may induce a weaker antitumor immune response, which aligns with the limited efficacy of immunotherapy in patients with EGFR mutations (43). In addition, TP53-mutant tumors exhibited a significant increase in TMB, consistent with the role of TP53 as a tumor suppressor gene. Furthermore, as our analyses showed that TCR diversity correlates with TMB and mVAF, we explored related immune markers and observed that tumors with high PD-L1 expression (TPS > 50%) also exhibited higher TMB. This supports the link between tumor genomic burden and immune response, which may influence TCR repertoire characteristics.
Overall, these findings demonstrate that TCR repertoire characteristics are affected by both tumor mutational load and oncogenic mutation type, highlighting the intrinsic relationship between TCR characteristics and tumor mutational profiles.

The TCR profiling for patients of IPNs
To further validate the diagnostic value of TCR profiling for IPNs, we conducted a multicenter prospective clinical study (Fig. 4A; Supplementary Table S3A–S3D). Preoperative blood samples from all enrolled patients were subjected to genomic-level TCR-seq. After filtering raw CDR3β aaSeqs, we identified a mostly common CDR3β motif pattern, CASSLGGGSNYEQYF, which was found in both benign and malignant groups (Fig. 4B). In our study, conventional prediction methods failed to effectively differentiate benign from malignant nodules, yielding AUC values below 0.6 for all methods (Fig. 4C). Furthermore, we evaluated the diagnostic performance of CT-based imaging indicators associated with lung cancer. Notably, radiologists’ experience achieved an AUC of 0.62 (Fig. 4D), highlighting the limitations of current imaging-based diagnostic approaches.
To enhance diagnostic accuracy, we analyzed various features derived from TCR profiling (Supplementary Table S3E–S3I). To assess their diagnostic value, we performed ROC analysis and Wilcoxon tests on all indices (Supplementary Fig. S5A–S5F). Our analysis revealed that three TCR V genes—TRBV28, TRBV27, and TRBV5-5—were identified as significant markers and achieved AUC values of 0.63, 0.61, and 0.61, respectively (Fig. 4E and F), indicating their importance in distinguishing between benign and malignant nodules. UMAP was applied to reduce the dimensionality of TCR-derived features, resulting in clear differentiation between benign and malignant groups (Fig. 4G). Collectively, TCR profiling has demonstrated its potential in aiding the diagnosis of pulmonary nodules, surpassing the efficacy of existing imaging-based approaches.

The construction and validation of the TCRnodseek plus model to diagnose IPNs
To integrate various features derived from clinical and imaging features and TCR characteristics, we performed systematic feature selection, explored different feature combinations, and compared various machine learning algorithms (Supplementary Table S4A–S4C). Recursive feature elimination analysis revealed that optimal diagnostic performance was achieved when selecting between 20 and 40 features. Next, we evaluated the performance using five different algorithms on the testing dataset while varying the number of selected features. Among these, a combination of 30 features demonstrated the best overall prediction performance (Fig. 5A). By utilizing this feature set, we constructed predictive models with five different algorithms: svmLinear, randomForest, svmRadial, gbm, and glmnet. Performance evaluation across multiple metrics identified glmnet as the most effective algorithm for our analysis (Fig. 5B). This glmnet-based model, named TCRnodseek plus, incorporates 30 key features (Supplementary Table S4D–S4F). Among the top-ranked features, radiologic characteristics, including nodule type [e.g., whether the nodule was a ground-glass nodule (GGN) or a solid nodule], nodule size, and the presence of spiculation, played important roles in classification, demonstrating their crucial relevance to malignancy assessment. Repertoire-based features such as the lung cancer–enriched scores (LCTS and LCBS), Pielou evenness index, and Shannon–Wiener diversity were also informative, suggesting altered clonal architecture in malignant conditions. In addition, the usage of specific TRBV genes (e.g., TRBV5-5 and TRBV27) and amino acid composition patterns (e.g., elevated leucine and tryptophan abundance) contributed to model performance, reflecting potential immune selection and tumor antigen recognition. These findings collectively support the biological interpretability and robustness of the integrated model (Fig. 5C). These features played a crucial role in the predictive performance of the model, further reinforcing the potential of TCR profiling in lung nodule diagnosis.
To facilitate user-friendly application, we integrated feature extraction and model application into the LungTCR platform (https://www.lungtcr.com). The TCRnodseek plus model is designed for seamless use, with 25 out of 30 input indicators that can be automatically generated when users upload their TCR-seq data to the platform. The module performs feature calculation to extract the necessary inputs, minimizing manual data processing. Users only need to provide five additional features: GGN, pSN, spiculation, nodule size, and smoking status. By following the guide provided in the model prediction section, users can easily apply the TCRnodseek plus model for lung nodule risk assessment.
To validate the TCRnodseek plus model, we evaluated its performance using an independent validation cohort. The model was evaluated across three groups: the training group, test group, and validation group. Notably, the AUC values for all three groups were above 0.8, demonstrating good predictive performance (training = 0.88, test = 0.83, and validation = 0.84). This indicates that the TCRnodseek plus model is a robust and effective method for predicting the malignancy of pulmonary nodules, suggesting its potential for clinical application (Fig. 5D and E).
To assess whether TCRnodseek plus outperforms a model based solely on clinical features or TCR features, we constructed both clinical- and TCR-only predictive models using the same glmnet algorithm. We selected 13 key clinical features, including GGN, nodule size, and others, based on their association with benign or malignant lung nodules. The construction process followed the same methodology as TCRnodseek plus. As expected, TCRnodseek plus demonstrated superior performance compared with the clinical model or TCR model in both the testing and validation groups (Fig. 5F and G). In the testing group, the AUC with a 95% CI increased from 0.79 (0.67–0.89) to 0.83 (0.73–0.92). Similarly, in the validation group, the AUC with a 95% CI increased from 0.75 (0.65–0.84) to 0.84 (0.76–0.90). Although the TCR-only model showed moderate performance (AUC = 0.67 in both test and validation sets), integration with clinical features significantly boosted performance (AUC = 0.84), demonstrating that TCR signals provide complementary and nonredundant diagnostic value. To further evaluate model performance while accounting for complexity, we calculated Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) values across the three models (Supplementary Table S5). TCRnodseek plus model consistently achieved the lowest AIC scores in both test and validation cohorts, indicating improved model fit. Although the clinical-only model showed slightly lower BIC in the test set—likely due to fewer input variables—TCRnodseek plus model still maintained competitive BIC values in the validation cohort. These results, together with AUC improvements, support the added value of integrating TCR features. These results indicate that TCRnodseek plus provides improved diagnostic accuracy and discriminative power in comparison with the model relying solely on clinical parameters.
Given the methodology optimization for the TCR profiling, we evaluated whether TCRnodseek plus outperformed our previous model—TCRnodseek—in the same cohort (9). In this work, we applied the optimized TCR-seq method and new model to the same validation group used in our previous study (9). This led to an increase in the AUC with a 95% CI from 0.8 (0.65–0.95) to 0.84 (0.69–0.95; Fig. 5H). These results indicate that optimizing the TCR library method and combining clinical and TCR features improved the accuracy and performance of the model, resulting in a more reliable and precise prediction of outcomes.

Evaluation of confounding factors of the TCRnodseek plus model
To assess the potential influence of confounding factors on the performance of the TCRnodseek plus model, we performed correlation analysis, adjustment analysis, and subgroup analysis. The TCRnodseek plus model generated a predictive value for each sample in both the testing and validation groups. Heatmap analysis revealed significant correlations between TCRnodseek plus prediction and several potential confounding factors, including gender, smoking, and GGN (Fig. 6A). In order to account for these confounding factors, we performed logistic regression analysis, which identified TCRnodseek plus prediction and nodule location in the upper lung as the only reliable predictors of malignancy (Fig. 6B). Nodule size showed negligible impact on cancer risk prediction (P = 0.51) when combined with TCR features (Fig. 6B; Supplementary Fig. S6A and S6B), confirming that TCR signatures are independent of nodule size.
Next, to evaluate whether the performance of the model was influenced by GGN presence, the testing/validation group was divided into GGN-positive and non-GGN subsets. In both subsets, the prediction values of TCRnodseek plus were significantly higher in patients with malignancy (P < 0.001 and P = 0.011, respectively; Fig. 6C). Furthermore, we also examined the impact of nodule size on model performance. The results demonstrated that patients with malignancy consistently exhibited higher TCRnodseek plus scores compared with benign cases across all size categories: small-size subsets (nodules smaller than 10 mm; P < 0.001), medium-size subsets (nodules between 10 and 20 mm; P < 0.001), and large-size subsets (nodules larger than 20 mm; P = 0.005; Fig. 6D). Further subgroup analysis of nodule position and gender demonstrated that these factors did not significantly affect the diagnostic value of TCRnodseek plus (Supplementary Fig. S6C and S6D). Through these analyses and the exploration of relationships and potential confounders, we gained valuable insights into the robustness and generalizability of the TCRnodseek plus model, confirming its consistent diagnostic performance across different patient subgroups.

Clinical applicability analysis for the TCRnodseek plus model
To further explore the clinical utility of the TCRnodseek plus in different clinical settings, we compared its performance with that of radiologists and conducted a clinical association study to investigate the relationship between TCRnodseek plus predictions and key clinical variables. The improvement in reclassification was indicated by net reclassification improvement and integrated discrimination improvement. Compared with radiologists’ diagnoses (rDOC_diagnosis), TCRnodseek plus demonstrated a 73% net reclassification improvement (95% CI, 52%–94%) and a 35% integrated discrimination improvement (95% CI, 26%–44%) in both the testing and validation groups (Fig. 6E). Moreover, the diagnostic performance of traditional cancer markers (e.g., carcinoembryonic antigen, CYFRA21-1) was evaluated in the validation group and yielded AUC values below 0.6 (Supplementary Fig. S7A and S7B). These findings indicate that the TCRnodseek plus models achieved better discriminatory power and higher classification accuracy for diagnosing pulmonary nodules compared with both radiologists’ diagnoses and traditional cancer markers.
Beyond diagnostic performance, the clinical interpretability of TCRnodseek plus was enhanced through individualized BreakDownProfile (44) visualizations (Supplementary Fig. S7C and S7D). These plots reveal how clinical features (e.g., nodule type) and immunologic signals (e.g., LCTS, TRBV usage) collectively drive predictions, providing physicians with a biological rationale for model outputs—particularly valuable when justifying predictions in diagnostically ambiguous cases.
To implement the TCRnodseek plus model in a clinical setting, we identified optimal prediction score cutoffs to guide clinical decision-making (45). For each disease prediction, a baseline cutoff was set to maximize overall accuracy, whereas two confident cutoffs were chosen to optimize the PPV (∼95%) and negative predictive value (∼95%) in the discovery group, resulting in three distinct cutoffs for each prediction score (Fig. 6F). To address the clinical need for accurate diagnosis and exclusion, we applied the confident thresholds to categorize the prediction values into three regions, ensuring that more than 60% of patients received a definitive diagnosis. Specifically, patients with a PPV exceeding 0.8 accounted for 56% of all validated cases. For those below the confident PPV threshold of 0.09, the negative predictive value reached 92%, effectively diagnosing benign conditions (Fig. 6G). When using the overall cutoff for binary classification, TCRnodseek plus achieved an overall accuracy of 81%, which was comparable with the discovery cohort. In contrast, the diagnostic accuracy of three independent radiologists ranged from 62% to 76%, showing significant heterogeneity and consistently underperforming compared with TCRnodseek plus (Fig. 6H). Overall, these findings indicate that TCRnodseek plus shows promising and robust performance in distinguishing benign from malignant pulmonary nodules and demonstrates its superiority over conventional diagnostic methods.

Functional characterization of lung cancer–enriched CDR3β aaSeqs
Understanding the biological significance of lung cancer–enriched CDR3β sequences is essential for their validation as molecular markers. The next question we addressed was why the cancer-enriched CDR3β aaSeqs can effectively distinguish lung cancer from benign conditions. We applied TCRdb 2.0 (https://guolab.wchscu.cn/TCRdb2/#/; ref. 46), a comprehensive database containing more than 691 million CDR3β sequences from 19,701 samples, to annotate all enriched CDR3β aaSeqs identified in this study (Supplementary Table S6A–S6F). Our analysis revealed that the majority of CDR3β aaSeqs enriched in healthy individuals are not annotated (Fig. 7A). This lack of annotation may be attributed to the high diversity of TCR repertoires in healthy individuals (9). Next, we observed that a significantly higher proportion of lung cancer–enriched CDR3β aaSeqs were annotated to NSCLC compared with those enriched in the healthy group (P < 0.001, Fig. 7B). However, unexpectedly, we found that the CDR3β aaSeqs enriched in LCT and malignant IPN (IPN_mal) showed similarities with the healthy enrichment group, with a P value greater than 0.05 (Fig. 7B).
Intriguingly, we further investigated the specific enrichment of CDR3β aaSeqs in different groups by removing any overlaps with healthy individuals from TCRdb 2.0 and eliminating duplicated CDR3β aaSeqs. This analysis revealed that both LCT and IPNs_mal exhibited a substantial number of unique CDR3β aaSeqs compared with the other groups. This suggests that these particular CDR3β aaSeqs are selectively enriched in the LCT and IPNs_mal groups, potentially indicating their biological relevance to lung cancer (Fig. 7C).
Considering the role of CDR3β aaSeqs in antigen recognition, we further analyzed whether these lung cancer–enriched CDR3β aaSeqs were associated with neoantigen-driven T-cell clonal expansion, as identified in the NeoTCR database (37). Our analysis revealed that a higher proportion of cancer-enriched CDR3 sequences were annotated as neoantigen-reactive T-cell clones, suggesting their potential involvement in tumor-specific immune responses (Fig. 7D).
To investigate the relationship between TCR profiles in peripheral blood and the tumor microenvironment, we compared the clonotype distribution of lung cancer–enriched CDR3 sequences with randomly selected CDR3 sequences from the same blood samples. From the TCR repertoire of LCB, we randomly selected 3,840 CDR3 sequences and estimated their total frequency in LCT tissues (Supplementary Table S6G). Our findings revealed that LCB-enriched CDR3 sequences were detected at a significantly higher abundance and frequency in tumors compared with random blood-derived CDR3 sequences (Fig. 7E). This suggests that the LCB-enriched CDR3 sequences likely originate from the tumor microenvironment and could serve as immune biomarkers for lung cancer.
Further analysis of paired blood and lung tissue TCR repertoires revealed that approximately 3% of T-cell clonotypes detected in lung tissue were also present in the corresponding blood samples (Fig. 7F). The proportion of overlapping clonotypes between paired tumor tissues was significantly higher than that observed in randomly selected tissue from the same cohort (P < 0.001, Fig. 7F). Similarly, the proportion of shared clonotypes in paired blood samples was significantly higher when compared with unpaired blood samples (randomly selected, P < 0.001, Fig. 7F; Supplementary Table S6H).
To further characterize the functional relevance of lung cancer–specific CDR3β aaSeqs, we integrated the lung cancer–specific CDR3β aaSeqs (Fig. 7C) with neoantigen analysis using the NeoTCR database and CAPED (https://caped.icp.ucl.ac.be/). We predicted the CDR3β aaSeqs–epitope interactions in lung cancer using ImRex (40). This analysis identified CDR3β aaSeqs–epitope pairs originating from CD8+ tumor–infiltrating lymphocytes (Supplementary Table S6I and S6J). The top 20 CDR3β aaSeqs (NeoTCR and CAPED) and corresponding epitope pairs based on binding scores are displayed (Fig. 7G; Supplementary Fig. S8A–S8E). Collectively, these findings demonstrate that lung cancer–enriched CDR3β aaSeqs have significant biological relevance, enabling them to effectively distinguish lung cancer from healthy individuals and benign cases. This reinforces the potential of TCR profiling as a valuable tool for lung cancer diagnosis and immune monitoring.

Discussion

Discussion
In this study, we demonstrated the utility of TCR profiling in improving the diagnostic accuracy of pulmonary nodules. We performed large-scale TCR-seq on 6,059 blood and 988 tumor samples and established the LungTCR database. Using this dataset, we developed a novel method for the quantitative assessment of tumor-related immune responses and constructed the TCRnodseek plus predictive model, validated in a multicenter cohort of 1,108 blood samples from patients with IPNs. TCRnodseek plus represents an advancement over previous approaches, demonstrating robust performance characteristics. This work highlights how TCR profiling can provide valuable molecular and immune insights to enhance the clinical assessment of IPNs.
Diagnosing pulmonary nodules as benign or malignant remains challenging, particularly when relying solely on clinical information. Current guidelines mention various machine learning models incorporating clinical and imaging data, yet previous studies, including our prior research, indicate these methods have limited effectiveness (9, 47). To address this, researchers have developed machine learning models that integrate blood-based biomarkers, such as an miRNA signature classifier using 24 plasma miRNAs (48), a classifier based on 41 whole-blood gene expression levels (49), the PulmoSeek model using DNA methylation (6), and the CancerEMC model and AI-aided pulmonary nodules diagnostic model based on cell-free DNA mutations (50, 51). Although these blood-based models show improved performance, they still struggle to definitively differentiate IPNs, underscoring the need for more precise and reliable biomarkers.
Here, we improved TCR-seq technology and, leveraging the LungTCR database, developed and validated the TCRnodseek plus model, which demonstrated robust and accurate diagnostic performance across two independent cohort datasets. In the independent validation cohort, this approach outperformed the previous TCRnodseek, traditional cancer biomarkers (e.g., carcinoembryonic antigen), the Mayo Clinic model, other guideline-recommended diagnostic models, and radiologists’ assessments. Furthermore, the data, models, and feature extraction methods utilized in this study are publicly accessible via LungTCR, promoting transparency and facilitating further research (https://www.lungtcr.com).
This study also explored the biological mechanisms underlying the diagnostic value of TCR characteristics in distinguishing lung cancer from benign pulmonary nodules. We first identified lung cancer–enriched and indeterminate lung nodule–enriched CDR3β sequences. A larger proportion of lung cancer–enriched CDR3β sequences were annotated to NSCLC. Further analysis identified unique CDR3β sequences specifically enriched in lung cancer tissue and malignant IPNs, indicating their potential as biomarkers. Interestingly, the CDR3β sequences enriched in LCB were found in higher abundance within the tumor microenvironment, suggesting they could serve as minimally invasive biomarkers. By integrating the lung cancer–enriched CDR3β sequences with neoantigen and tumor antigen databases, we predicted potential CDR3β–epitope pairs that may play a role in lung cancer pathogenesis, warranting further experimental validation. In summary, the lung cancer–enriched CDR3β sequences demonstrate significant biological relevance in distinguishing lung cancer from benign lesions.
A notable limitation is that TCR profiling in a single experiment cannot capture the entire TCR repertoire due to the extremely high diversity and abundance of T cells. Studies have demonstrated that there are at least 106 unique TCR clones in human peripheral blood (52). However, various factors—such as PCR amplification efficiency and sequencing depth—limit the actual number of detectable TCR clones. In this study, the average number of detected TCR clones was ∼104, which may suggest that only a small portion of high-frequency clones could be reliably detected and measured. Despite these technical and biological limitations, this study demonstrates a proof of concept for using tumor-related TCR signatures as potential biomarkers for cancer detection and progression monitoring. Deeper sequencing and profiling in the future may enable finer-resolution characterization of cancer-associated TCR signatures, further enhancing the diagnosis of lung cancer.
In addition, due to the challenges in obtaining paired healthy lung tissue samples, our primary comparisons relied on peripheral blood from healthy donors and patients with lung cancer, as well as a subset of tumor tissues. Although we validated our findings using external datasets containing paired tumor, adjacent normal tissue, and lymph node samples, the absence of healthy lung tissue controls means we cannot definitively distinguish between lung tissue–resident TCRs and those truly specific to tumors. Although the LCT-enriched TCRs showed higher abundance in tumors compared with adjacent tissues in external validation, their presence in nonmalignant lung tissue suggests they may partially reflect local immune infiltration rather than exclusively tumor-specific responses. Nevertheless, the detection of these TCRs in peripheral blood may still hold clinical relevance, as they could indicate lung-specific immune activity or early tumor–related immune perturbations. Future studies incorporating multiregion sampling of tumor-adjacent normal tissues and longitudinal blood collections will help further refine the specificity of these TCR signatures.
In conclusion, the lung cancer–enriched CDR3β sequences demonstrate significant biological relevance in distinguishing lung cancer from benign pulmonary nodules. These results underscore the potential of TCR-based biomarkers for improving the clinical management of IPNs, offering a minimally invasive and immune-driven approach to lung cancer diagnosis.

Supplementary Material

Supplementary Material
Supplementary Figure S1Modified TCR library methodological schematic overview and experimental validation for PCR amplification bias

Supplementary Figure S2Performance improvement of the optimized TCR library construction

Supplementary Figure S3TCR profiling of LungTCR database and lung cancer-enriched CDR3aa

Supplementary Figure S4Evaluation and external validation of lung cancer-enriched TCR sequences

Supplementary Figure S5TCR profiling of blood samples from patients with indeterminate pulmonary nodules

Supplementary Figure S6Evaluation of confounding factors and the analysis of clinical applicability for the TCRnodseek plus model

Supplementary Figure S7Model interpretability and lung cancer-enriched TCRs validation

Supplementary Figure S8Biology and Disease Annotation of Enriched CDR3-beta aaSeqs

Supplementary Table S1Clinical and TCR characteristics for LungTCR database

Supplementary Table S2Identify lung cancer-enriched TCR sequences

Supplementary Table S3Clinical and TCR characteristics for patients with indeterminate pulmonary nodules

Supplementary Table S4Feature selection and model construction

Supplementary Table S5Evaluate multiple model performance

Supplementary Table S6Annotation of lung cancer enriched CDR3-beta sequences

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기