The price of precision: the cost of preprocessing for automated code revision in code review
Informazioni aggiuntive
Autori
Pirouzkhah S.,
Rani P.,
Sovrano F.,
Hellendoorn V.,
Bacchelli A.
Tipo
Articolo pubblicato in rivista scientifica
Anno
2025
Lingua
Inglese
Sommario
Code review is a widespread practice in software engineering during which developers examine each other’s source code changes to identify potential issues and improve code quality. Among the automated techniques proposed by researchers to reduce the manual workload of code review, Automated Code Revision (ACR) aims to automatically address reviewers’ feedback by producing a revised version of the code. Transformer-based language models have demonstrated state-of-the-art results in ACR. The performance of these models, however, is significantly influenced by the quality and preparation of the training and evaluation data. We present several systematic analyses of prevalent preprocessing steps, examined both cumulatively and in isolation, across three established preprocessing pipelines and two dataset splitting strategies (time-level vs. project-level). Our study spans across models of different scales: OpenNMT (small), T5 and CodeReviewer (mid-sized), LoRA-tuned CodeLLaMA-7B (large), and GPT-3.5-Turbo (large, black-box). Using datasets up to 496k training records, we evaluate and statistically compare models’ performance using exact match ratio (EXM), CodeBLEU, and Levenshtein ratio. Our findings show that preprocessing may be a significant component in the success of the different techniques: OpenNMT relies on heavy preprocessing; T5 benefits from light filtering (selective removal of records); CodeReviewer performs best when trained on larger, less aggressively filtered data; CodeLLaMA-7B and ChatGPT-3.5 Turbo are largely indifferent to preprocessing. Overall, the effectiveness of ACR tools depends on aligning preprocessing with model scale and training setup. In general, small models need abstraction, mid-sized ones benefit from light filtering, and large-scale models perform best when trained on the original, unprocessed form of the code.
Periodico
Empirical Software Engineering
Volume
31
Numero ( Mese )
2
Pagine (o numero dell’articolo)
47
ISSN
1382-3256, 1573-7616
Diffusione
Licenza
Licenza non definita
Visibilità
Privato