MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

🔔News

🔥 [09/18/2025]: We released MedFact, a benchmark for fact-checking on Chinese medical texts. 🥳

Abstract

The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent "over-criticism" phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models. MedFact Example

Overview

MedFact a challenging benchmark for Chinese medical fact-checking, which is built upon three core principles:

Rigorously Designed Pipeline: We construct MedFact through a human-in-the-loop pipeline that integrates large-scale AI filtering with fine-grained annotation by medical professionals. We also employ hard-case mining to systematically retain challenging instances designed to probe the limits of current LLMs.
Broad and Realistic Coverage: MedFact is curated from diverse real-world texts such as medical encyclopedias. It comprises 2,116 expert-annotated medical texts that span 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels.
Uncontaminated Data: MedFact is constructed from proprietary texts under copyright-compliant agreements, making it highly unlikely to be part of the common pre-training corpora for LLMs. This helps ensure a fairer assessment of their true fact-checking capabilities.

Examples

Case 1

Case 2

Case 3

Case 4

Case 5

BibTeX

@misc{he2025medfactbenchmarkingfactcheckingcapabilities,
      title={MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts}, 
      author={Jiayi He and Yangmin Huang and Qianyun Du and Xiangying Zhou and Zhiyang He and Jiaxue Hu and Xiaodong Tao and Lixian Lai},
      year={2025},
      eprint={2509.12440},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.12440}, 
}