UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception
:::info Stub — Full Engineering Breakdown Coming This paper was auto-fetched from arXiv on 2026-06-01. A full breakdown with production viability rating, implementation notes, and honest limitations is being written. Subscribe to AI Letters → :::
| Authors | Yuhan Song et al. |
| Year | 2026 |
| Field | NLP |
| arXiv | 2605.31521 |
| Download | |
| Categories | cs.CL, cs.SD |
Abstract
Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks. We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio-Token mitigates its information loss through two key innovations: (1) Semantic-Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory-scene primitives; and (2) Semantic-Acoustic Equilibrium (SAE) introduces a content-aware gating mechanism that adaptively restores fine-grained acoustic details from shallow layers. Extensive evaluations show that UniAudio-Token learns comprehensive universal representations while preserving high-fidelity speech generation. When integrated with downstream LLMs, it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.
Engineering Breakdown
The Problem
However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks.
The Approach
We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability.
Key Results
We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.
Research Areas
This paper contributes to the following areas of AI/ML engineering:
- Large language models
- Transformers
- Text generation
- Natural language processing
- Language understanding
- Uniaudiotoken
:::tip Subscribe Get weekly breakdowns of papers like this in AI Letters - the newsletter for engineers building production AI systems. :::
