A Model Cleansing Pipeline for Model-driven Engineering: Mitigating the Garbage In, Garbage Out Problem for Open Model Repositories
FT
This program is tentative and subject to change.
In data-driven research within Model-Driven Engineering (MDE), the extraction of conceptual models, such as UML diagrams, from software repositories is a crucial step for analyzing software design, evolution, and quality. However, these extracted models often contain inconsistencies, redundancies, and noise because most model repositories are not curated. Without effective data cleansing, the reliability of empirical and machine learning (ML)-based MDE studies working with these repositories is seriously threatened. This paper proposes a data cleansing pipeline designed to effectively cleanse model repositories. Our approach systematically addresses common data quality issues by offering a sequence of automated pre-processing, validation, and filtering steps based on rule-based heuristics and ML techniques. By integrating conceptual modeling-specific data cleansing techniques into an automated pipeline, our approach reduces manual intervention, enhances reproducibility, and supports scalable analysis of model repositories. In an experimental evaluation of open-source UML diagram repositories, we demonstrate the effectiveness of our method in cleansing models. In two reproducibility studies, we further show the statistically significant effect the use of our MCP4CM pipeline has on downstream tasks.