본문으로 건너뛰기
← 뒤로

A Multimodal Ultrasound-Driven Approach for Automated Tumor Assessment With B-Mode and Multi-Frequency Harmonic Motion Images.

1/5 보강
IEEE transactions on bio-medical engineering 📖 저널 OA 25% 2021: 1/1 OA 2024: 1/2 OA 2025: 1/4 OA 2026: 2/14 OA 2021~2026 2026 Vol.73(2) p. 498-509
Retraction 확인
출처

Hu S, Liu Y, Wang R, Li X, Konofagou EE

📝 환자 설명용 한 줄

[OBJECTIVE] Harmonic Motion Imaging (HMI) is an ultrasound elasticity imaging method that measures the mechanical properties of tissue using amplitude-modulated acoustic radiation force (AM-ARF).

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)
  • 표본수 (n) 32

이 논문을 인용하기

↓ .bib ↓ .ris
APA Hu S, Liu Y, et al. (2026). A Multimodal Ultrasound-Driven Approach for Automated Tumor Assessment With B-Mode and Multi-Frequency Harmonic Motion Images.. IEEE transactions on bio-medical engineering, 73(2), 498-509. https://doi.org/10.1109/TBME.2025.3586250
MLA Hu S, et al.. "A Multimodal Ultrasound-Driven Approach for Automated Tumor Assessment With B-Mode and Multi-Frequency Harmonic Motion Images.." IEEE transactions on bio-medical engineering, vol. 73, no. 2, 2026, pp. 498-509.
PMID 40614147 ↗

Abstract

[OBJECTIVE] Harmonic Motion Imaging (HMI) is an ultrasound elasticity imaging method that measures the mechanical properties of tissue using amplitude-modulated acoustic radiation force (AM-ARF). Multi-frequency HMI (MF-HMI) excites tissue at various AM frequencies simultaneously, allowing for image optimization without prior knowledge of inclusion size and stiffness. However, challenges remain in size estimation as inconsistent boundary effects result in different perceived sizes across AM frequencies. Herein, we developed an automated assessment method for tumor and focused ultrasound surgery (FUS) induced lesions using a transformer-based multi-modality neural network, HMINet, and further automated neoadjuvant chemotherapy (NACT) response prediction. HMINet was trained on 380 pairs of MF-HMI and B-mode images of phantoms and in vivo orthotopic breast cancer mice (4T1). Test datasets included phantoms (n = 32), in vivo 4T1 mice (n = 24), breast cancer patients (n = 20), FUS-induced lesions in ex vivo animal tissue and in vivo clinical settings with real-time inference, with average segmentation accuracy (Dice) of 0.91, 0.83, 0.80, and 0.81, respectively. HMINet outperformed state-of-the-art models; we also demonstrated the enhanced robustness of the multi-modality strategy over B-mode-only, both quantitatively through Dice scores and in terms of interpretation using saliency analysis. The contribution of AM frequency based on the number of salient pixels showed that the most significant AM frequencies are 800 and 200 Hz across clinical cases.

[SIGNIFICANCE] We developed an automated, multimodality ultrasound-based tumor and FUS lesion assessment method, which facilitates the clinical translation of stiffness-based breast cancer treatment response prediction and real-time image-guided FUS therapy.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

📖 전문 본문 읽기 PMC JATS · ~58 KB · 영문

INTRODUCTION

I.
INTRODUCTION
Accurate tumor size assessment is critical for the diagnosis and therapy of breast cancer. Whereas benign lesions are usually associated with distinct, well-circumscribed margins, irregular-shaped masses with spiculate edges are highly suspected of malignancy. These imaging features not only help physicians to identify suspicious masses for biopsies but also infer therapeutic outcomes. In neoadjuvant chemotherapy (NACT), a systemic treatment before surgery to downstage early-stage, locally advanced breast cancer, its response gives guidance to adjuvant treatment and informs prognosis [1]. Image-derived biomarkers can be established to predict patients’ treatment outcomes after NACT, which is evaluated by the post-surgical pathology. The desired outcome is pathologic complete response (pCR), defined as the absence of cancer cells in the primary tumor and lymph nodes, associating with long-term clinical benefits such as disease-free survival (DFS) and overall survival (OS) [2]. Early prediction of NACT response is imperative for timely treatment adjustment in clinical practice. In non-invasive thermal ablation therapy using focused ultrasound, heat absorption causes a rapid temperature increase at the focus, leading to protein denaturation and lesion formation. Intraoperative imaging allows surgeons to track procedure progress and determine endpoints [3]. A high spatial and temporal-resolution imaging modality for treatment monitoring guarantees reliable treatment and facilitates the clinical translation of FUS.
Among all imaging techniques, ultrasound is routinely used for breast imaging because it provides quality images of dense breasts while being non-ionizing, non-invasive, and cost-effective [4]. On traditional B-mode images, masses usually present as darker regions, i.e., hypoechoic, compared with healthy tissues. On the other hand, ultrasound elastography estimates the underlying mechanical properties of tissue to provide insights into its physiological conditions. Tumors are often substantially stiffer than normal tissues due to the desmoplastic reaction, which is featured by the deposition and alteration of extracellular matrix (ECM) proteins [5]. The desmoplasia-associated increased stiffness is considered a critical factor in tumorigenesis, proliferation, and drug resistance in breast cancer [6], [7] and can be captured by elasticity imaging. Though B-mode is well-established for tumor size measurement, its performance can be unsatisfactory due to fuzzy tumor edges and mixed echogenicity patterns [8]. The addition of elasticity images can increase the segmentation accuracy [9].
Harmonic Motion Imaging (HMI) is an ultrasound elastography technique. It deforms tissue using focused acoustic radiation force (ARF) and estimates the resulting on-axis displacements. “Harmonic motion” refers to the oscillatory tissue motion induced by amplitude-modulated acoustic radiation force (AM-ARF). Because the pre-defined modulation frequency is distinct from physiological motion (e.g., breathing) in the low-frequency range, noise that compromises displacement estimation can be efficiently suppressed using bandpass filtering [10]. Compared with shear wave elastography (SWE), which measures off-axis displacements to derive the shear wave speed and therefore the shear modulus quantitatively, and Acoustic Radiation Force Impulse (ARFI) imaging, which measures on-axis displacements after the impulse excitation, HMI uses AM-ARF (AM frequency: 25 to 2000 Hz) and measures on-axis displacements during tissue excitation. These allow HMI to be less prone to shear wave reflections and able to penetrate deeper.
So far, our group has demonstrated HMI’s ability in tissue characterization [11], [12], tumor treatment response prediction [13] and FUS treatment monitoring [10], [14]. For image-guided focused ultrasound surgery (FUS), whereas MR thermometry can accurately map temperature changes during ablation, it suffers from poor temporal resolution and limited accessibility. Alternatively, real-time ultrasound-based monitoring is more compatible with FUS and can easily validate whether the treating site is positioned correctly [15]. However, chaotic cavitation events during FUS complicate the estimation of lesion size on B-mode images. By generating a high-intensity focused AM-ARF, HMI-guided focused ultrasound surgery (HMIgFUS) ablates the targeted tissue and, in the meantime, assesses the formation of lesions through stiffness-based measurements. The capability of performing ablation and monitoring simultaneously is a unique advantage of the HMIgFUS system.
Recently, the development of multi-frequency HMI (MF-HMI) allows displacement measurements at various AM frequencies simultaneously [16], [17]. It offers flexibility in optimizing image quality regardless of the inclusion dimension and stiffness, given that inclusions respond differently to ARF oscillations. Higher frequencies are more sensitive to stiffer and/or smaller inclusions. For a stiffer inclusion, the shear wave speed is higher, which is associated with a larger shearing region and boundary estimation error. Increased AM frequency results in a smaller shearing region, therefore reducing the boundary effects. However, image quality cannot be improved monotonically as the signal-to-noise ratio will be negatively impacted as attenuation increases with shear wave frequencies [17]. This underscores the necessity of exploring MF-HMI images to enhance lesion detectability. Additionally, manual segmentation on MF-HMI images can be time-consuming and is influenced by confounding factors such as the operator’s experience and the choice of dynamic range. For the downstream tumor assessment based on HMI, predictions to NACT are highly subject to the selection of regions of interest (ROI). As a result, an automated segmentation method can increase objectivity and accuracy in NACT response prediction. Additionally, an efficient network that supports real-time segmentation capability is a critical step in translating image-guided FUS therapy to the clinic. Networks built on MF-HMI and B-mode were developed in our previous work, in which we showed moderately accurate results on phantom and in vivo murine tumors and one clinical case [18] [19]. However, these approaches to learning from multiple AM frequencies and clinical performance can be further improved.
Many efforts have been made in the field of machine learning-aided analysis on combining multi-modality ultrasound images. Xie et al. [20] employed automated breast ultrasound (ABUS) and contrast-enhanced ultrasound (CEUS) to generate tumor size, blood flow, and vessel information to classify pCR/non-pCR after NACT. Misra et al. [21] proposed a multimodal fusion network for segmentation and malignancy classification utilizing B-mode and strain elastography. We believe that deep learning-based analysis on MF-HMI is also worth exploring, as the mechanical information is intrinsically linked with the tumor microenvironment [22], and MF-HMI provides high quality and flexibility among elasticity imaging techniques.
To maximize the utility of the acquired data and enhance segmentation accuracy, we utilized both MF-HMI and B-mode images. This task can be further formulated into a multimodality segmentation problem. It has been shown that analysis can greatly benefit from more than one imaging source, each of which provides complementary information, even if some of the inputs are suboptimal, in accordance with weak learnability in ensemble learning [23]. In the task of brain tumor segmentation, Guo [24] using convolutional neural networks (CNN) empirically demonstrated higher accuracies of multimodality networks over single-modality; interestingly, the performance of multimodal networks on degraded images by adding artificial noise outperformed the single-modality counterparts. They further concluded that the network combining modalities at early and intermediate phases i.e., input levels or within the classifier, performed better than those at later stages (decision-making level). In addition to considering when to integrate modalities, some works adaptively adjust the weights of modalities to generate a more discriminative model [25] and customize the loss function by minimizing the joint loss of multi-modality prediction outcome and losses of individual modalities [26].
Major contributions of this work are summarized as follows:
Constructed a segmentation neural network, HMINet. It enabled learning from more than one modality and multiple input channels (11) and achieved high clinical generalizability through transfer learning.

Demonstrated that the multimodal strategy enhanced the robustness of clinical data assessment, which was supported by saliency analysis; also identified the most important AM frequencies across clinical cases based on the number of salient pixels.

Automated the workflow to predict breast tumor response to NACT. The automatic segmentation and generation of regions of interest (ROI) ensured objective tumor response evaluation.

Segmented lesions forming during HMIgFUS in ex vivo tissue and a patient, and demonstrated the clinical feasibility of real-time inference.

METHODS

II.
METHODS
A.
Dataset
This study was approved by the Institutional Animal Care and Use Committee and the Institutional Review Board of Columbia University. A total of 380 pairs of MF-HMI and B-mode images were used for training, including breast-tissue mimicking phantoms of elastic inclusions (N = 214, diameter: 1.7, 2.5, 4.1, 6.5, 10.4 mm, Young’s modulus: 6, 9, 36, 72 kPa, in an 18-kPa background), inclusions (N = 94, diameter: 3, 5, 8, 10, 13 mm, Young’s modulus: 31 – 160 kPa, background: 5–10 kPa) with/without phase aberrations generated by ex vivo animal tissue (2–3 cm layers of chicken breast or porcine tissue) or metallic wire mesh, as well as in vivo 4T1 murine tumors (N = 72). Both phantom and in vivo mouse data were preprocessed and used during training. The validation dataset (N = 72) contains phantom data (N = 40) of different acquisitions but similar inclusion conditions, and in vivo mouse data (N = 32) of different acquisitions on the same mice. For in vivo animal data acquisition, 4T1 tumor cell lines (106 per 50 𝜇𝐿), which are used to model human triple-negative breast cancer (TNBC), were subcutaneously implanted into the left lower quadrant of the abdomen of eight-week-old female BALB/c mice (obtained from The Jackson Laboratory). Tumor progression was monitored weekly using MF-HMI for three weeks, starting from 6 days after tumor cell injection.
The test dataset included 1) phantom data (n = 32); 2) in vivo mouse data (n = 24), both of which were different from the training and validation dataset; 3) FUS lesions on ex vivo animal tissue; 4) FUS lesions of a patient monitored in real-time by the clinical HMIgFUS system; 5) and in vivo human malignant tumors (twenty samples (n = 20) from eleven patients (n = 11) with one/two time-points). Patient information was summarized in Table I. Clinical studies were performed after obtaining written consent, in the clinical exam room for NACT patients, and in the surgery suite under general anesthesia for the HMIgFUS patient. MF-HMI and HMI were acquired on freshly obtained ex vivo chicken breast tissue and the patient, respectively, using HMIgFUS.

B.
Multi-frequency Harmonic Motion Imaging (MF-HMI)
In a conventional HMI setup, a focused ultrasound transducer generates oscillatory ARF for tissue deformation, and a co-aligned imaging transducer tracks the resulting displacements [11], [12]. Oscillatory ARF is achieved by amplitude modulation (AM) using an AM signal containing one or several oscillation frequencies (ranging from 25 to 2000 Hz). Multi-frequency excitation (MFE) is carried out by either summing up a set of sinusoids (1) or generating a chirp sequence that contains a range of frequencies (2).
Hossain et al. [16] developed the single-transducer HMI (ST-HMI) system that employs an imaging transducer to both excite tissue motion and estimate the resulting displacement. The sequence of MF-ST-HMI is a sum of ten sinusoids (from 100 Hz to 1000 Hz, 100 Hz at a step), as shown in (1). The AM signal was then used to modulate the amplitude (pulse duration) of the excitation pulses. A squared scaling factor, , was applied to compensate for more significant acoustic attenuation at high AM frequencies. Multiple short tracking pulses were sent in between long ARF excitation pulses for motion tracking. The overall interleaved excitation-tracking sequence had a frame rate of 10 kHz and a duration of 40–60 ms, consisting of 4–6 cycles of the fundamental AM frequency (100 Hz).
The experimental setup for ST-HMI is illustrated in Fig. 1a. The imaging transducer (L7–4 (Philips Healthcare) for phantom and patient studies, L11–5 (Verasonics) for mouse studies) was driven by a Vantage Research System (Vantage 256, Verasonics Inc., Kirkland, WA, USA). Focused excitation and tracking beams were generated by sub-apertures with center frequencies and F numbers of 4 MHz, 2.25, and 6 MHz, 1.75, respectively. To acquire 2D MF-HMI images, electronic steering was used to translate the sub-aperture from one side to the other, covering a lateral field of view of 18 mm with 32/38 evenly spaced RF lines (0.6 mm spacing). A B-mode image with 128 RF lines was collected right after HMI measurements. For clinical data acquisition, a commercially available handheld ultrasound probe (Butterfly) was first used to locate the tumor, following which MF-HMI and research B-mode were acquired using the L7–4 transducer. Fig. 1b. shows the experimental setup for HMIgFUS system. It consisted of a single-element focused ultrasound transducer driven by a dual-channel arbitrary waveform generator amplified by a 50-dB gain power amplifier and an imaging transducer L22–14vxLF (, Vermon, Tours, France) [27]. L22 was driven by the Vantage system and was confocally aligned with the focused ultrasound transducer to track tissue response. Focused ultrasound transducer generated prolonged, high-intensity, amplitude modulated excitation pulses (compared to HMI) for thermal ablation and oscillatory motion excitation at the same time, enabling mechanical property measurement, i.e., the direct effect of FUS. The multi-frequency AM signal was a chirp signal with linearly varying instantaneous frequency as shown in (2). The ablation lasted for 120 s with a 50% duty cycle. The derated imaging frame rate was 1 kHz. FUS ran continuously during the entire 120 s-ablation, while the imaging pulses stopped every 0.15s for data transfer and resumed after the transfer was finished, resulting in a total of 89 MF-HMI acquisitions.
The clinical HMIgFUS system was set up in a similar fashion, where HMIgFUS generated high-intensity AM-ARF using a focused ultrasound transducer (fc = 3.1 MHz, −6 dB: 0.41×2.65 mm, depth = 28 mm), and monitored the mechanical effects of FUS using a phased array (fc = 7.8 MHz). The clinical intervention (derated 𝑃𝑛𝑒𝑔: 6.0 MPa, duration: 120s, 116 acquisitions of HMI) was performed in the surgical suite prior to the start of their scheduled surgery, where patients were under general anesthesia. Displacements around the focus were estimated in real-time to inform the treatment progress.

C.
Data Processing
Channel data was beamformed using a GPU-based delay-and-sum (DAS) algorithm. For B-mode image formation, envelope detection and log compression were applied. Normalized 1D cross-correlation (window size: , 99% overlap) [28] was applied to estimate either the peak-to-peak displacement or peak positive displacement in the axial direction. Ten displacement images, from 100 to 1000 Hz, 1100 to 2000 Hz (100 Hz at a step), or from 100 to 550 Hz (50 Hz at a step), were derived from ST-HMI or HMIgFUS acquisition. Median filtering (window size of 1.5 mm) was applied to filter out outliers from displacement estimation errors and noise. An axial displacement profile (lateral size: 1–1.5 mm) was obtained in the background to correct the acoustic force attenuation across depths for ST-HMI images. More details can be found in [17], [27].
Due to the inherent heterogeneity of breast tumors, the stiffness of the tumor can vary broadly across patients, depending on their pathological conditions. Therefore, we calculated the displacement of the tumor relative to the surrounding tissue to assess the tumor response over the course of NACT treatment. As illustrated in (3), displacement ratios (DR) were computed as the ratio between the median displacement (D) within the tumor region of interest (ROI) and D in the ROI of surrounding tissue.

D.
Multi-modality Segmentation Network
In recent years, some pre-trained models that require few or no domain adaptation have shown great promise in medical image segmentation, e.g., segment-anything-model (SAM) [29]. Nonetheless, unlike RGB or greyscale images, MF-HMI is rich in channels (ten channels corresponding to ten AM frequencies), providing detailed spectral information. Moreover, the boundary effects are modality-specific, distinctive from noise patterns commonly seen in other medical images. Lastly, we aim to incorporate complementary information by also utilizing B-mode images to enhance segmentation through a multi-modal approach.
For these reasons, we chose to train our network, HMINet [30], from scratch. Based on previous studies, the modality fusion strategies can be categorized into three types: early-stage fusion combining all image channels at input level, intermediate-stage fusion that integrate modalities within the network, and late-stage method that incorporates the outputs of modality-specific networks to get final segmentation results. We employed the intermediate fusion strategy to better handle imbalanced dimensions of different modalities, i.e., B-mode (1 channel) and MF-HMI (10 channels) after feature extraction. The requirement of interaction between different modalities also makes later fusion unsuitable [31].
The structure was adapted from mmformer [26], a multimodality transformer-based network. Fig. 2 illustrates the architecture of HMINet. For a multi-channel input 𝑋 as shown in (4), the B-mode and MF-HMI part go through modality-specific Convolutional encoders (5-stage module consists of cascaded norm, conv, activation), modality-specific Transformer encoder (8 attention heads), a modality-correlated encoder (with Bernoulli indicators to randomly drop out a modality during training), and 5-stage convolutional decoders (Conv decoder). The rationale for selecting mmformer was based on several key characteristics: the hybrid, modality-specific encoder facilitated the local and long-distance context learning, and the following cross-modality encoder ensured feature alignments across B-mode and MF-HMI; the auxiliary regularizer encouraged additional images was not from low to high frequency but randomized every epoch, which also helped to restrain overfitting.

E.
Model Training, Fine-tuning and Interpretation
HMINet was trained for 250 epochs on NVIDIA RTX A5500, using an Adam optimizer (learning rate = 2e-4, betas = (0.9, 0.999), eps = 1e-8, weight decay = 1e-4) with Dice-Cross independent modality learning, to prevent the model from heavily relying on either B-mode or MF-HMI.
Pre-processing steps were applied to input images: registration between MF-HMI and B-mode images; standardization (rescale to the range [0, 1]); histogram equalization (equalizeHist) was applied to enhance the contrast; for phantom inclusions that were softer than the background and lesions monitored during FUS, where displacements within the inclusion were higher than the background, a linear transformation involving inversion and adding offsets, was applied to make all inclusion regions have lower values than the background. This approach reduced the false positive rate by preventing the network from misclassifying noisy regions as inclusions. All images were resized to 144 pixels × 144 pixels (0.07–0.13 mm per pixel). Data augmentation was applied during training, including random cropping and adding noise patches of random sizes every epoch. For FUS lesion visualization, displacement images in subsequent frames were subtracted from the first frame to show lesion size based on stiffness changes due to ablation. For the clinical HMIgFUS lesions, HMI monitoring was performed at 100 Hz, and HMI images were replicated to the other nine channels to keep the input data format consistent. To accommodate different AM frequencies used for HMI and HMIgFUS, the order of MF-HMI Entropy (DiceCE) Loss, and the epoch of the best model was selected based on the highest segmentation accuracy in the validation dataset. According to our experimental results, the number of epochs was large enough for the training loss to converge, and the validation loss began to decrease. Segmentation accuracy was evaluated using the Dice Similarity Score, as shown in (5), measuring the similarity between the segmented (seg) and reference (ref) area.
For clinical image analysis, to mitigate the domain shifts between training and unseen clinical data, transfer learning was applied to fine-tune the pre-trained weights of the network, and the results were evaluated via 5-fold cross-validation. Reference boundaries of human breast tumors were delineated manually on research B-mode with the guidance of Butterfly B-mode and HMI images. All boundaries were drawn by an experienced researcher (3-year relevant experience) and reviewed by a registered sonographer. Regions of interest (ROI) were generated automatically based on the HMINet-segmented boundaries for displacement ratio (DR) calculation. The tumor ROI covered 50% area of the bounding box of the segmented tumor, and the surrounding background ROI was at the same depth as the tumor ROI with a 2 mm lateral offset to the tumor boundary. DRs were calculated at all AM frequencies, and the median value was used to represent relative stiffness at that time point. In FUS lesion segmentation, to provide output stability, boundaries were generated every four frames, given by the union of four masks derived from HMINet after thresholding. Both B-mode and gross pathology were used as references to evaluate the network’s segmentation result.
To better understand the network prediction results, we applied saliency analysis to visualize the spatial significance of input images. The saliency of a pixel can be estimated as the spatial derivative (gradient) of the loss function L with respect to the input image I at the point , as shown in (6). The magnitude in saliency maps indicates which pixel needs to be changed least to affect model output the most; in other words, most spatially important [32]. We computed saliency maps by backpropagating the Binary Cross Entropy (BCE) loss [33] between the model’s predicted segmentation output and the ground truth mask. For a pixel at point , the saliency was estimated by the maximum absolute gradient value across channels after normalization to [0, 1], as illustrated in (7). Saliency of each pixel was aggregated into the 2D parametric map, where the high-valued regions correspond to the locations where pixels contributed more to segmentation. To validate if multi-modality segmentation outperforms single modality, we trained and tested networks under same architecture (HMINet) but using multi-modality inputs and B-mode only, and compared the segmentation accuracy as well as their saliency representations.
Beyond visualization, we also investigated the role of different AM frequencies in network decision-making, where we resolved the number of salient pixels for each AM frequency across clinical cases. We first computed the per-pixel saliency values across all MF-HMI channels (AM frequencies) and set a threshold of the 95th percentile to preserve only the most impactful pixels. The number of salient pixels for each frequency is counted and displayed in a heatmap, providing an overview of relative frequency importance.

RESULTS

III.
RESULTS
Segmentation performances of different networks (U2NET, nnUNet, SWIN-UNETR, and HMINet using multi-modality and B-mode only) are summarized in Table II. Implementation details are described in Table III; all parameters were chosen based on the original literature to ensure optimal performance. We observed that the best-performing network is HMINet. Then, we fine-tuned the model to get the results of patient data, achieving a Dice of 0.803±0.106, demonstrating better adaptation to clinical data through transfer learning.
A.
Phantom and mouse studies
Fig. 3. shows representative B-mode and MF-HMI images from 100 to 1000 Hz, manually segmented and HMINet-segmented boundaries of a 2.5-mm, 70-kPa inclusion, and a 6.5-mm, 70-kPa inclusion. HMI images at 1000 and 700 Hz (the optimal frequencies with the highest contrast-to-noise ratio in each case) are overlaid on B-modes for visualization. The 2.5-mm, 70-kPa inclusion has better detectability in high AM frequencies since a small inclusion in HMI was more prone to the axially shearing effect; the 6.5-mm, 70-kPa inclusion can be clearly seen across all AM frequencies. Reduced boundary effects are observed as AM frequencies increase due to less shearing in the axial direction, however, SNR drops at 900 and 1000 Hz.
Fig. 4. shows a representative case of a longitudinal monitoring of murine tumor progression using MF-HMI. HMI images with the highest CNR are overlaid on B-mode images for visualization. It is observed that the tumor grew larger and stiffer (darker blue) over time. HMINet performed well at all time points with Dice of 0.88, 0.85, and 0.82.

B.
Assessment of human breast tumor response to NACT
Fig. 5 shows the HMI, B-mode, HMINet-segmented boundaries and saliency maps (from multi-modality and B-mode-only network). Patients 1–3 were breast cancer patients without NACT before surgery; patients 4–7 received NACT and were scanned at baseline and 3-week into treatment. Overall, the multi-modality segmentation outperformed B-mode-only segmentation in 7 out of 11 cases, and the average Dice scores across all clinical cases (n = 20) are 0.805 and 0.785, respectively. Poor performance of the B-mode-only network was shown in patients 1, 3, 5, and 7, when the tumor had fuzzy boundaries and mixed echogenic patterns, potentially due to multi-layers and calcification (appearing as bright clusters in B-mode, as shown in patient 5, 2nd time-point). Similar findings were also found in other automated B-mode-based breast tumor segmentation methods [34]. Calcification, i.e., the deposition of calcium, produces high echogenicity that could be problematic in B-mode-based segmentation, but it did not lead to substantial stiffness changes due to the small percentage of the tumor volume [35] thus having a limited impact on elasticity imaging. The misleading fine structures were segmented correctly by adding mechanical information, as shown in the outputs of the multi-modal network.
Furthermore, we generated saliency maps of each network prediction. The brighter regions dominated the networks’ attention. Pointed by orange arrows, maps of B-mode-only had some high-saliency regions concentrated around the multi-layer structures (patients 1, 6, 7) and calcifications (patient 5) within tumors rather than the actual boundaries. In addition, we calculated the average surface distance (ASD) between the HMINet-segmented and reference boundaries, quantifying the ability to capture shape irregularity. The average ASD of these clinical cases were 1.22 and 1.62, respectively. Notably, in patient 3 with irregular tumor margin, compared to multimodality, B-mode-segmented contour exhibited pronounced misalignments with the reference boundaries (76% increase of ASD) despite only having a slightly lower Dice score (13% decrease in Dice). This suggests that the multi-modality network demonstrates greater sensitivity in detecting marginal features.
For the NACT response prediction, the pretreatment DRs of patients 4 to 7 were 0.43, 0.11, 0.20, and 0.07, indicating that patients 4 and 6 had slightly softer tumors than patients 5 and 7. Three weeks later, the DR of patients 4 and 7 decreased, which suggested tumor stiffening and non-responding. Conversely, significant increases (345% and 160%) in DRs were observed in patients 5 and 6. Based on these DR changes, patients 4 and 7 were predicted as npCR, and patients 5 and 6 were predicted to achieve pCR (pCR result of patient 5 was unavailable due to transfer of care, and patient 7 is pending pathological confirmation). It is worth noting that, only three weeks into NACT, when the tumor size hadn’t changed much, if any, HMI-derived DR was sensitive enough to predict the pathological result at the treatment endpoint.

C.
Contribution of AM Frequency
Fig. 6. presents the salient pixel counts for each AM frequency. It showed that 800 Hz and 200 Hz had the most salient pixels among patients, suggesting more separate saliency distributions between tumor and non-tumor regions, thus possessing comparatively higher impacts on the model’s segmentation decisions. In contrast, 1000 and 300 Hz had the least salient pixels. The higher significance on both low- and high-ranged frequencies implies that the network learnt comprehensively and jointly collected the advantages of high contrast provided by low frequency and better-defined boundaries (though noisier) at high-frequency range.

D.
Monitoring of FUS lesioning on ex vivo tissue and in vivo patient
Fig. 7. shows the HMINet-segmented FUS-induced lesion on MF-HMI and B-mode images during ablation. Though the FUS lesion stiffens due to protein denaturation after ablation, tissue displacements around the FUS focus were higher than in the unablated region, potentially due to the liquifying process under heat absorption. It is observed that at 550 Hz, the perceived lesion size was smaller compared to 150 Hz. However, the contrast was not as high as that at 150 Hz. Though B-mode before and after FUS ablation did show some differences, the lesioning process could not be observed during ablation. As observed in Fig. 7, MF-HMI was able to monitor the lesioning progress. Gross pathology was performed on the imaging plane immediately after ablation; a reference boundary was drawn and registered to HMI/B-mode images, and the Dice was calculated to be 0.85. The lesion boundary was also delineated from the difference between B-mode images before and after ablation, as denoted in the blue line.
Fig. 8. demonstrates the utility of the real-time lesion segmentation in a clinical case. A 46-year-old female diagnosed with fibroadenoma was treated with the clinical HMIgFUS system under a derated of 6 MPa for 120s, stiffness changes were monitored by HMI during ablation at 100 Hz. Due to the limitations of version compatibility between MATLAB and Pytorch, a simplified version of HMINet (based on U2NET) was imported to MATLAB as an ONNX file. Using our current computation resources (Intel®Xeon® Gold 6238R CPU @ 2.20 GHZ), the average inference time was 2.04s per batch (4 frames), providing a frame rate of 1.96 Hz. HMINet segmented lesion boundaries were compared with the reference boundaries generated by the activecontour method with manually selected seeds and an iteration of 200. The average Dice score was 0.762±0.030. It is observed that the segmented lesion was highly localized around the focal spot (depth = 28 mm), with a slight shift toward the pre-focal region, and its shape gradually turned into a tadpole as the treatment proceeded. The lesion size increase rate was aligned with the displacement decreasing rate as both reached plateaus at around 50s into ablation, suggesting that the automated segmentation method precisely tracked the intervention progress on HMIgFUS in a real clinical setting.

DISCUSSION

IV.
DISCUSSION
HMI is an elasticity imaging technique that induces localized, oscillatory tissue movement and estimates the resulting on-axis displacement during excitation. It is operator-independent and provides deep penetration up to a focal depth of 30 mm [17]. Stiffness alterations in tumors can occur prior to any noticeable anatomical changes, which grants HMI the potential to make early predictions of NACT treatment response. At the same time, because of its experimental design, it offers real-time FUS lesioning monitoring with excellent image quality that allows precise margin delineation to guarantee complete therapy.
In this study, we developed an automated tumor and FUS lesion assessment method for multi-frequency HMI systems. The goal was to fully utilize multi-frequency HMI (different tissue responses to AM frequencies) and the multi-modality information (acoustic and mechanical) from B-mode and MFHMI. We explored this idea using neural networks and further improved its clinical performance through transfer learning. HMINet was shown to achieve high segmentation accuracy and predicted tumor response to NACT based on stiffness changes as early as three weeks into treatment. Moreover, the network using only B-mode as inputs was compared with that has multimodal inputs, and the benefits of combining imaging modalities were shown in both Dice score and Average surface distance. Unlike B-mode images, where speckle noise limits the edge and fine structure detection [36], high contrast in MF-HMI makes it more precise when depicting margin details. The enhanced robustness in multi-modality segmentation was further strengthened by saliency analysis. Some highlighted regions in the saliency maps of B-mode-only segmentation correlated with mixed echogenicity/calcification, which were correctly classified using multi-modal inputs.
It is also worth emphasizing that the method we proposed has actual real-time capability for FUS therapy monitoring. Based on our current computation resources, to obtain one frame of HMI image, the data acquisition time is 2s, the beamforming time 1.55s, the displacement estimation 59.73s, median filtering 18.00s, bandpass filtering 5.58s, and the inference time 0.51s, which accounts for only 0.6% of the entire processing time. We will keep optimizing the processing speed to achieve automated lesion segmentation on HMI for real-time intervention guidance in the clinic.
In MF-HMI, DR is affected not only by frequency-dependent boundary effects due to the shearing in the axial direction after external acoustic radiation force excitation but also by the internal viscoelasticity of targeted tissue. As a result, median values of DR were used to assess the stiffness change over time to avoid a misleading analysis due to the less satisfactory quality of a specific AM frequency. In future work, the DR-AM frequency relationship will be investigated to correlate with the viscoelastic properties of tumors, which can potentially serve as another biomarker for tumor response prediction besides DR. Moving forward, to enrich the information in HMI measurements and increase reproducibility, 3D HMI will be performed using row-column-array (RCA) [37], followed by 3D automatic segmentation.
In the proposed network, the input order of frequency was randomized. Though the effects of AM frequency on HMI image quality have been explored in phantoms, the relationship hasn’t been fully established in mouse and patient data. By shuffling the frequency order, we aimed to force the network to make decisions based on the most significant features in those multi-channel images, but not a fixed trend with frequency channels. It reduces the risks of network overfitting and promotes the transferability of the network in different AM frequency ranges.
Limited clinical data size has always drawn concerns in medical imaging analysis, especially about the network’s domain adaptation and poor generalizability in rare cases. In this study, we imaged phantoms covered by ex vivo tissues and wire mesh to introduce phase aberrations and mimic clinical data acquisition. The segmenting ability of HMINet in murine, human tumors, and clinical FUS lesions demonstrated that, despite the network being trained on circular phantoms, it wasn’t overfitted to circular shapes and remained flexible in predicting irregular, less circular shapes. Furthermore, we applied transfer learning and cross-validation to reduce the effect of insufficient data size. From a clinical perspective, the main challenge that we may need to deal with to fully extend HMINet’s clinical utility is that not all malignant or benign breast tumor types can be included in the current dataset due to the limited sample size available and the low naturally occurring rate among the patient population. The underrepresentation of certain types and their associated image features [38] might lead to inaccurate learning and segmentation. One way that this limitation can be addressed is to expand the training dataset and generate synthetic data using diffusion models [39].
Determining reference boundaries in clinical and FUS lesion datasets could be challenging, as there was no real ground truth. To obtain accurate annotations, clinical reference boundaries were annotated by an experienced observer and later confirmed by a certified sonographer, with disagreement addressed; however, the evaluation method was still subject to inter-observer variability, a known limitation of ultrasound imaging. For clinical FUS lesion, reference boundaries were generated by MATLAB built-in function, activecontour, after manual hyperparameter selection. Nonetheless, they should not be considered as the actual lesion margins. In the future, histopathological analysis will be performed to compare the largest dimension of tumors and necrosis regions in the case of HMIgFUS with that derived from the predicted boundaries.
In addition to improving segmentation accuracy, we have also gained a deeper understanding of the impact of AM frequency on boundary segmentation through saliency analysis. We identified that 800 and 200 Hz are the dominant AM frequencies, potentially providing key information, including clear margin delineation (high frequency) and robust image contrast (low frequency), consistent with our previous studies [40]. The tumor sizes were not very different in the current clinical dataset. In continued exploration, we would validate our results on tumors of a more extensive size range. It is also worth mentioning that identifying the most significant AM frequencies should not be interpreted as middle-ranged frequencies without merit. The synergic effect on the network due to multiple AM frequencies might be complicated. To make a rigorous study of frequency interdependency, we can observe the performance change in the absence of one or different combinations of AM frequencies.
From the perspective of data acquisition, the image quality of MF-HMI could be further improved. Liu et al. demonstrated that the displacement tracking sequence can be optimized in conventional HMI [41]. To improve image quality, especially in clinical data, imaging parameters such as the pressure of ARF, parallel tracking, and F-number can be further investigated. We can also advance the ARF sequence to contain certain AM frequencies informed by saliency analysis.
Moving toward the goal of predicting NACT efficacy and facilitating real-time FUS ablation monitoring, this automated segmentation network will serve as the initial stage for tumor and FUS lesion quantification. In the future, we will increase clinical data and establish other HMI-derived biomarkers, such as perilesional stiffening area, for more precise, subtype-specific NACT prediction. We will also systematically verify its robustness on FUS lesion segmentation by testing on more FUS lesions and in patients.

CONCLUSION

V.
CONCLUSION
In this study, we developed an automated tumor and FUS lesion quantification method using a multimodality, transformer-based network, HMINet. Based on both the echogenicity and elasticity information of breast tumors, it exhibited superior performance on the phantom, in vivo mouse, breast cancer patient, and FUS-lesion datasets, achieving an average Dice score of 0.91, 0.83, 0.80, and 0.81, respectively. Enhanced clinical robustness of multi-modal over B-mode-only segmentation was shown in both quantitative results regarding average Dice and ASD, and through a model interpretation method using saliency analysis. Displacement ratios were calculated based on HMINet-segmented boundaries to predict tumor responses to NACT. Real-time inference (2 Hz) was implemented on MATLAB for a clinical case using the pretrained HMINet. Future work will be focused on deepening our understanding of the impact of AM frequency on lesion detectability in more clinical cases and developing a multi-task model capable of segmentation and radiomic extraction for NACT prediction and individualized FUS treatment outcome prediction.

Supplementary Material

Supplementary Material
HMI real-time monitoring with lesion segmentation

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기