Development of a robust corpus for automated evaluation of online health information in Chinese using the DISCERN scale.
[OBJECTIVE] To develop the first comprehensive, standardized annotated corpus of Chinese online health information (OHI) using the full 16-item DISCERN instrument and to establish a reliable annotatio
APA
E T, Li X, et al. (2026). Development of a robust corpus for automated evaluation of online health information in Chinese using the DISCERN scale.. Journal of the American Medical Informatics Association : JAMIA, 33(2), 316-325. https://doi.org/10.1093/jamia/ocaf175
MLA
E T, et al.. "Development of a robust corpus for automated evaluation of online health information in Chinese using the DISCERN scale.." Journal of the American Medical Informatics Association : JAMIA, vol. 33, no. 2, 2026, pp. 316-325.
PMID
41223037
Abstract
[OBJECTIVE] To develop the first comprehensive, standardized annotated corpus of Chinese online health information (OHI) using the full 16-item DISCERN instrument and to establish a reliable annotation process that supports automated quality assessment.
[MATERIALS AND METHODS] We assembled 510 web-sourced articles on breast cancer, arthritis, and depression. All the articles were independently annotated by three trained raters using the DISCERN scale. Annotation followed a four-step workflow: data collection and preprocessing, rater training, iterative annotation, and quality control. Raters calibrated through consensus sessions and calibration articles. The Dawid-Skene model aggregated individual annotations into final consensus scores. Original five-point ratings were retained and binarized (scores 1-3 as low quality, 4-5 as high quality) to enable both fine-grained and coarse evaluation for machine learning.
[RESULTS] Initial annotation of a 60-article pilot produced low agreement (mean Krippendorff's α ≈ 0.022) due to subjective variability. Successive calibration exercises improved agreement markedly, culminating in a corpus-wide Krippendorff's α of 0.834. Consensus scores correlated strongly with individual rater scores, confirming annotation robustness. The dual-scale design yielded a relatively balanced distribution of labels across topics, with roughly equal representation of low- and high-quality articles, and preserved granularity for detailed DISCERN analysis.
[DISCUSSION] Our iterative calibration approach and consensus modeling effectively addressed the subjective ambiguity inherent in quality assessment. The binary and five-class labeling strategies facilitate flexible downstream applications, allowing automated systems to perform both broad filtering and nuanced quality differentiation. The high inter-rater reliability demonstrates that rigorous training and consensus methods can overcome domain-specific annotation challenges.
[CONCLUSION] The resulting Chinese OHI corpus, annotated via a standardized DISCERN framework and refined through iterative calibration, provides a robust benchmark for training and evaluating machine learning models. This resource lays the foundation for scalable, reliable automated quality assessment of OHI in Chinese public health settings.
[MATERIALS AND METHODS] We assembled 510 web-sourced articles on breast cancer, arthritis, and depression. All the articles were independently annotated by three trained raters using the DISCERN scale. Annotation followed a four-step workflow: data collection and preprocessing, rater training, iterative annotation, and quality control. Raters calibrated through consensus sessions and calibration articles. The Dawid-Skene model aggregated individual annotations into final consensus scores. Original five-point ratings were retained and binarized (scores 1-3 as low quality, 4-5 as high quality) to enable both fine-grained and coarse evaluation for machine learning.
[RESULTS] Initial annotation of a 60-article pilot produced low agreement (mean Krippendorff's α ≈ 0.022) due to subjective variability. Successive calibration exercises improved agreement markedly, culminating in a corpus-wide Krippendorff's α of 0.834. Consensus scores correlated strongly with individual rater scores, confirming annotation robustness. The dual-scale design yielded a relatively balanced distribution of labels across topics, with roughly equal representation of low- and high-quality articles, and preserved granularity for detailed DISCERN analysis.
[DISCUSSION] Our iterative calibration approach and consensus modeling effectively addressed the subjective ambiguity inherent in quality assessment. The binary and five-class labeling strategies facilitate flexible downstream applications, allowing automated systems to perform both broad filtering and nuanced quality differentiation. The high inter-rater reliability demonstrates that rigorous training and consensus methods can overcome domain-specific annotation challenges.
[CONCLUSION] The resulting Chinese OHI corpus, annotated via a standardized DISCERN framework and refined through iterative calibration, provides a robust benchmark for training and evaluating machine learning models. This resource lays the foundation for scalable, reliable automated quality assessment of OHI in Chinese public health settings.
MeSH Terms
Humans; China; Internet; Machine Learning; Consumer Health Information; Breast Neoplasms; Data Curation; Depression