Binary token-level classification with DeBERTa for all-type MWE identification
a lightweight approach with linguistic enhancement
Additional information
Authors
Type
Article in conference proceedings
Year
2026
Language
English
Abstract
We present a comprehensive approach for multiword expression (MWE) identification that combines binary token-level classification, linguistic feature integration, and data augmentation. Our DeBERTa-v3-large model achieves 69.8% F1 on the CoAM dataset, surpassing the best results (Qwen-72B, 57.8% F1) on this dataset by 12 points while using 165x fewer parameters. We achieve this performance by (1) reformulating detection as binary token-level START/END/INSIDE classification rather than span-based prediction, (2) incorporating NP chunking and dependency features that help discontinuous and NOUN-type MWEs identification, and (3) applying oversampling that addresses severe class imbalance in the training data. We confirm the generalization of our method on the STREUSLE dataset, achieving 78.9% F1. These results demonstrate that carefully designed smaller models can substantially outperform LLMs on structured NLP tasks, with important implications for resource-constrained deployments.
Keywords
MWE, NLP, Token Classification
Conference proceedings
Findings of the Association for Computational Linguistics (EACL 2026)
Meeting name
EACL 2026 – 17th Conference of the European Chapter of the Association for Computational Linguistics
Meeting place
Rabat, Morocco
Meeting date
March 24–29, 2026
Diffusion
License
CC BY
Visibility
Public
Status open access
Gold