arXiv:2502.04465 Abstract | arXiv Analytics

arXiv:2502.04465 [cs.LG]Abstract References Reviews Resources

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

Luca Della Libera, Francesco Paissan, Cem Subakan, Mirco Ravanelli

Published 2025-02-06Version 1

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at https://lucadellalib.github.io/focalcodec-web/.

Comments: 18 pages

Categories: cs.LG, cs.AI, cs.SD, eess.AS

Keywords: focal modulation networks, low-bitrate speech coding, downstream tasks, focalcodec successfully preserves sufficient semantic, acoustic information

Related articles: Most relevant | Search more

arXiv:2309.17002 [cs.LG] (Published 2023-09-29)

Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks

Hao Chen et al.

arXiv:2211.03782 [cs.LG] (Published 2022-11-07)

On minimal variations for unsupervised representation learning

Vivien Cabannes, Alberto Bietti, Randall Balestriero

arXiv:2306.00206 [cs.LG] (Published 2023-05-31)

Representation Reliability and Its Impact on Downstream Tasks

Young-Jin Park, Hao Wang, Shervin Ardeshir, Navid Azizan

arXiv Analytics

arXiv:2502.04465 [cs.LG]Abstract References Reviews Resources

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

Links

Toolbox

arXiv:2502.04465 [cs.LG]AbstractReferencesReviewsResources

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

Links

Toolbox

arXiv:2502.04465 [cs.LG]Abstract References Reviews Resources