Ricerca di contatti, progetti,
corsi e pubblicazioni

Binary token-level classification with DeBERTa for all-type MWE identification
a lightweight approach with linguistic enhancement

Informazioni aggiuntive

Tipo
Contributo in atti di convegno
Anno
2026
Lingua
Inglese
Sommario
We present a comprehensive approach for multiword expression (MWE) identification that combines binary token-level classification, linguistic feature integration, and data augmentation. Our DeBERTa-v3-large model achieves 69.8% F1 on the CoAM dataset, surpassing the best results (Qwen-72B, 57.8% F1) on this dataset by 12 points while using 165x fewer parameters. We achieve this performance by (1) reformulating detection as binary token-level START/END/INSIDE classification rather than span-based prediction, (2) incorporating NP chunking and dependency features that help discontinuous and NOUN-type MWEs identification, and (3) applying oversampling that addresses severe class imbalance in the training data. We confirm the generalization of our method on the STREUSLE dataset, achieving 78.9% F1. These results demonstrate that carefully designed smaller models can substantially outperform LLMs on structured NLP tasks, with important implications for resource-constrained deployments.
Parole chiave
MWE, NLP, Token Classification
Titolo atti di convegno
Findings of the Association for Computational Linguistics (EACL 2026)
Nome convegno
EACL 2026 – 17th Conference of the European Chapter of the Association for Computational Linguistics
Luogo convegno
Rabat, Morocco
Data convegno
March 24–29, 2026

Diffusione

Licenza
CC BY
Visibilità
Pubblico
Status open access
Gold