UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

:::info Stub — Full Engineering Breakdown Coming This paper was auto-fetched from arXiv on 2026-06-01. A full breakdown with production viability rating, implementation notes, and honest limitations is being written. Subscribe to AI Letters → :::


Authors	Yuhan Song et al.
Year	2026
Field	NLP
arXiv	2605.31521
PDF	Download
Categories	cs.CL, cs.SD

Abstract

Semantic speech tokenizers have become a widely used interface for Audio-LLMs, owing to their compact single-codebook design and strong linguistic alignment. However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks. We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability. Instead of altering the semantic paradigm, UniAudio-Token mitigates its information loss through two key innovations: (1) Semantic-Acoustic Primitives (SAP) provide structured supervision by decomposing audio into linguistic content, vocal attributes, and auditory-scene primitives; and (2) Semantic-Acoustic Equilibrium (SAE) introduces a content-aware gating mechanism that adaptively restores fine-grained acoustic details from shallow layers. Extensive evaluations show that UniAudio-Token learns comprehensive universal representations while preserving high-fidelity speech generation. When integrated with downstream LLMs, it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks, effectively serving as a unified audio interface. We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.

Engineering Breakdown

The Problem

However, their focus on linguistic abstraction induces acoustic blindness, limiting their applicability beyond speech-centric tasks.

The Approach

We propose UniAudio-Token, a framework that empowers semantic tokenizers with general audio perception without compromising speech ability.

Key Results

We publicly release all our code, including training and inference scripts, together with the model checkpoints at https://github.com/Tencent/Universal_Audio_Tokenizer.

Research Areas

This paper contributes to the following areas of AI/ML engineering:

Large language models
Transformers
Text generation
Natural language processing
Language understanding
Uniaudiotoken

:::tip Subscribe Get weekly breakdowns of papers like this in AI Letters - the newsletter for engineers building production AI systems. :::

Back to Research Lab → · Subscribe to AI Letters →

Abstract​

Engineering Breakdown​

The Problem​

The Approach​

Key Results​

Research Areas​