Skip to Main Content
HBS Home
  • About
  • Academic Programs
  • Alumni
  • Faculty & Research
  • Baker Library
  • Giving
  • Harvard Business Review
  • Initiatives
  • News
  • Recruit
  • Map / Directions
Faculty & Research
  • Faculty
  • Research
  • Featured Topics
  • Academic Units
  • …→
  • Harvard Business School→
  • Faculty & Research→
Publications
Publications
  • 2023
  • Article
  • Conference on Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track

Benchmarking Large Language Models on CMExam—A Comprehensive Chinese Medical Exam Dataset

By: Junling Liu, Peilin Zhou, Yining Hua, Dading Chong, Zhongyu Tian, Andrew Liu, Helin Wang, Chenyu You, Zhenhua Guo, Lei Zhu and Michael Lingzhi Li
  • | Pages:22
ShareBar

Abstract

Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that GPT-4 had the best accuracy of 61.6% and a weighted F1 score of 0.617. These results highlight a great disparity when compared to human accuracy, which stood at 71.6%. For explanation tasks, while LLMs could generate relevant reasoning and demonstrate improved performance after finetuning, they fall short of a desired standard, indicating ample room for improvement. To the best of our knowledge, CMExam is the first Chinese medical exam dataset to provide comprehensive medical annotations. The experiments and findings of LLM evaluation also provide valuable insights into the challenges and potential solutions in developing Chinese medical QA systems and LLM evaluation pipelines.

Keywords

Large Language Model; AI and Machine Learning; Analytics and Data Science; Health Industry

Citation

Liu, Junling, Peilin Zhou, Yining Hua, Dading Chong, Zhongyu Tian, Andrew Liu, Helin Wang, Chenyu You, Zhenhua Guo, Lei Zhu, and Michael Lingzhi Li. "Benchmarking Large Language Models on CMExam—A Comprehensive Chinese Medical Exam Dataset." Conference on Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track 36 (2023).
  • Read Now

About The Author

Michael Lingzhi Li

Technology and Operations Management
→More Publications

More from the Authors

    • 2025
    • Journal of Business & Economic Statistics

    Statistical Inference for Heterogeneous Treatment Effects Discovered by Generic Machine Learning in Randomized Experiments

    By: Kosuke Imai and Michael Lingzhi Li
    • 2024
    • Journal of Causal Inference

    Neyman Meets Causal Machine Learning: Experimental Evaluation of Individualized Treatment Rules

    By: Michael Lingzhi Li and Kosuke Imai
    • 2024
    • Faculty Research

    Learning to Cover: Online Learning and Optimization with Irreversible Decisions

    By: Alexander Jacquillat and Michael Lingzhi Li
More from the Authors
  • Statistical Inference for Heterogeneous Treatment Effects Discovered by Generic Machine Learning in Randomized Experiments By: Kosuke Imai and Michael Lingzhi Li
  • Neyman Meets Causal Machine Learning: Experimental Evaluation of Individualized Treatment Rules By: Michael Lingzhi Li and Kosuke Imai
  • Learning to Cover: Online Learning and Optimization with Irreversible Decisions By: Alexander Jacquillat and Michael Lingzhi Li
ǁ
Campus Map
Harvard Business School
Soldiers Field
Boston, MA 02163
→Map & Directions
→More Contact Information
  • Make a Gift
  • Site Map
  • Jobs
  • Harvard University
  • Trademarks
  • Policies
  • Accessibility
  • Digital Accessibility
Copyright © President & Fellows of Harvard College.