Reasoning over Knowledge Graphs: Enhancing LLMs for Trustworthy Medical Question Answering

<p>Jinghui Mao, Menglin Cui, Yuejia Dai</p>

doi:10.25236/AJCIS.2026.090503

Academic Journal of Computing & Information Science, 2026, 9(5); doi: 10.25236/AJCIS.2026.090503.

Reasoning over Knowledge Graphs: Enhancing LLMs for Trustworthy Medical Question Answering

Author(s)

Jinghui Mao, Menglin Cui, Yuejia Dai

Corresponding Author:

Menglin Cui

Affiliation(s)

Business School, University of Shanghai for Science and Technology, Shanghai, China

Download PDF
|
Download: 4
|
View: 372

Abstract

Large language model (LLM)-based medical question-answering systems show promising potential for clinical consultation and health information services. However, their application in high-stakes medical scenarios is limited by issues such as hallucinated responses, unreliable reasoning, and insufficient factual grounding. To address these challenges, this paper integrates a domain-specific medical knowledge graph to constrain LLM outputs for trustworthy medical QA. In addition, a graph-validated Weighted Factuality Score is introduced to evaluate the factual reliability of generated responses by verifying atomic facts against knowledge graph evidence. Experimental results on a medical dataset show that the proposed knowledge graph-enhanced RAG framework improves the average factuality score compared with the baseline. These results demonstrate that incorporating knowledge graph constraints enhances both the factual reliability and interpretability of LLM-based medical QA systems.

Keywords

Knowledge graph, Large language models, Medical question answering

Cite This Paper

Jinghui Mao, Menglin Cui, Yuejia Dai. Reasoning over Knowledge Graphs: Enhancing LLMs for Trustworthy Medical Question Answering. Academic Journal of Computing & Information Science (2026), Vol. 9, Issue 5: 20-27. https://doi.org/10.25236/AJCIS.2026.090503.

References

[1] Bedi S, Jiang Y, Chung P, et al. Fidelity of medical reasoning in large language models[J]. JAMA Network Open, 2025, 8(8): e2526021.

[2] Roy S, Khatua A, Ghoochani F, et al. Beyond accuracy: Investigating error types in GPT-4 responses to USMLE questions[C]//Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval. 2024: 1073-1082.

[3] Asgari E, Montaña-Brown N, Dubois M, et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation[J]. NPJ digital medicine, 2025, 8(1): 274.

[4] Freyer O, Wiest I C, Kather J N, et al. A future role for health applications of large language models depends on regulators enforcing safety standards[J]. The Lancet Digital Health, 2024, 6(9): e662-e672.

[5] de Hond A, Leeuwenberg T, Bartels R, et al. From text to treatment: the crucial role of validation for generative large language models in health care[J]. The Lancet Digital Health, 2024, 6(7): e441-e443.

[6] Amugongo L M, Mascheroni P, Brooks S, et al. Retrieval augmented generation for large language models in healthcare: A systematic review[J]. PLOS Digital Health, 2025, 4(6): e0000877.

[7] Li X, Cui M, Li J, et al. A hybrid medical text classification framework: Integrating attentive rule construction and neural network[J]. Neurocomputing, 2021, 443: 345-355.

[8] Prenosil G A, Weitzel T K, Bello S C, et al. Neuro-symbolic AI for auditable cognitive information extraction from medical reports[J]. Communications Medicine, 2025, 5(1): 491.

[9] Sheth A, Khandelwal V, Roy K, et al. NeuroSymbolic Knowledge-Grounded Planning and Reasoning in Artificial Intelligence Systems[J]. IEEE Intelligent Systems, 2025, 40(2): 27-34.

[10] Cui M, Li X, Qin P. Explainable Knowledge-Based Learning for Online Medical Question Answering[C]. International Conference on Knowledge Science, Engineering and Management. Singapore: Springer Nature Singapore, 2024: 294-304.

[11] Wang Y, Wang B, Mercer R, et al. Trustworthy medical question answering: An evaluation-centric survey[C]. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025: 27477-27490.

[12] Min S, Krishna K, Lyu X, et al. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023: 12076-12100.

[13] Vedula N, Parthasarathy S. Face-keg: Fact checking explained using knowledge graphs[C]. Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 2021: 526-534