Skip to yearly menu bar Skip to main content


In-Person Poster presentation / poster accept

CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

Nadezhda Chirkova · Sergei Troshin

MH1-2-3-4 #54

Keywords: [ Applications ] [ tokenization ] [ byte-pair encoding ] [ source code processing ]


Abstract:

Recent works have widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account source code specifics. We propose subtokenziation that reduces average length by 17--40% without downstream performance drop, and show that a carefully chosen subtokenization may improve quality by 0.5-2%, possibly with some length increase.

Chat is not available.