MCeT: Behavioral Model Correctness Evaluation using Large Language Models (MODELS 2025 - Research Papers)

Who

Khaled Ahmed, Jialing Song, Boqi Chen, Ou Wei, Bingzhou Zheng

Track

MODELS 2025 Research Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 8 Oct 2025 14:00 - 14:18 at DCIH 102 - Session 3: Large Language Models and Modeling Chair(s): Bentley Oakes

Abstract

Behavioral model diagrams, e.g., sequence diagrams, are an essential form of documentation that are typically designed by system engineers from requirements documentation, either fully manually or assisted by design tools. With the growing use of Large Language Models (LLM) as AI modeling assistants, more automation will be involved in generating diagrams. This necessitates the advancement of automatic model correctness evaluation tools. Such a tool can be used to evaluate both manually and AI automatically generated models; to provide feedback to system engineers, and enable AI assistants to self-evaluate and self-enhance their generated models.

In this paper, we propose MCeT, the first fully automated tool to evaluate the correctness of a behavioral model, sequence diagrams in particular, against its corresponding requirements text and produce a list of issues that the model has. We utilize LLMs for the correctness evaluation tasks as they have shown outstanding natural language understanding ability. However, we show that directly asking an LLM to compare a diagram to requirements finds less than 35% of issues that experienced engineers can find. We propose to supplement the direct check with a fine-grained, multi-perspective approach; we split the diagram into atomic, non-divisible interactions, and split the requirements text into atomic, self-contained items. We compare the diagram with atomic requirements and each diagram-atom with the requirements. We also propose a self-consistency checking approach that combines perspectives to mitigate LLM hallucinated issues. Our combined approach improves upon the precision of the direct approach from 0.58 to 0.81 in a dataset of real requirements. Moreover, the approach finds 90% more issues that the experienced engineers found than the direct approach, and reports an average of 6 new issues per diagram.

Link to Preprint

https://arxiv.org/abs/2508.00630

Khaled Ahmed

Huawei Research Canada, University of British Columbia (UBC)

Canada

Jialing Song

Huawei Technologies Canada

Canada

Boqi Chen

McGill University

Canada

Ou Wei

Huawei Technologies Canada

Canada

Bingzhou Zheng

Huawei Technologies Canada

Canada

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 8 Oct
Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30	Session 3: Large Language Models and ModelingResearch Papers / New Ideas and Emerging Results (NIER) at DCIH 102 Chair(s): Bentley Oakes Polytechnique Montréal Hybrid

14:00 18m Talk		MCeT: Behavioral Model Correctness Evaluation using Large Language ModelsFT Research Papers Khaled Ahmed Huawei Research Canada, University of British Columbia (UBC), Jialing Song Huawei Technologies Canada, Boqi Chen McGill University, Ou Wei Huawei Technologies Canada, Bingzhou Zheng Huawei Technologies Canada Pre-print
14:18 18m Talk		Model-Driven Quantum Code Generation Using Large Language Models and Retrieval-Augmented Generation New Ideas and Emerging Results (NIER) Nazanin Siavash University of Colorado Colorado Springs (UCCS), Armin Moin University of Colorado Colorado Springs
14:36 18m Talk		Towards LLM-enhanced Conflict Detection and Resolution in Model Versioning New Ideas and Emerging Results (NIER) Martin Eisenberg Johannes Kepler University, Linz, Stefan Klikovits Johannes Kepler University, Linz, Manuel Wimmer JKU Linz, Konrad Wieland LieberLieber Software GmbH
14:54 18m Talk		SHERPA: A Model-Driven Framework for Large Language Model ExecutionFT Research Papers Boqi Chen McGill University, Kua Chen McGill University, José Antonio Hernández López Department of Computer Science and Systems, University of Murcia, Gunter Mussbacher McGill University, Daniel Varro Linköping University / McGill University, Amir Feizpour Aggregate Intellect Pre-print
15:12 18m Talk		Accurate and Consistent Graph Model Generation from Text with Large Language ModelsFT Research Papers Boqi Chen McGill University, Ou Wei Huawei Technologies Canada, Bingzhou Zheng Huawei Technologies Canada, Gunter Mussbacher McGill University Pre-print