Add VibeVoice Acoustic Tokenizer by ebezzam · Pull Request #43400 · huggingface/transformers

ebezzam · 2026-01-22T01:45:26Z

What does this PR do?

Splitting off acoustic tokenizer from #40546
Such that VibeVoice ASR can be done in a separate / independent PR

Model card: https://huggingface.co/bezzam/VibeVoice-AcousticTokenizer

cc @eustlb

ebezzam · 2026-01-22T01:53:02Z

run-slow: vibevoice_acoustic_tokenizer

github-actions · 2026-01-22T01:54:05Z

This comment contains run-slow, running the specified jobs:

models: ["models/vibevoice_acoustic_tokenizer"]
quantizations: []

HuggingFaceDocBuilderDev · 2026-01-22T01:55:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-01-22T02:07:51Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

vibevoice_acoustic_tokenizer:
tests/models/vibevoice_acoustic_tokenizer/test_modeling_vibevoice_acoustic_tokenizer.py::VibeVoiceAcousticTokenizerIntegrationTest::test_batch_integration

ebezzam · 2026-01-22T02:16:59Z

run-slow: vibevoice_acoustic_tokenizer

github-actions · 2026-01-22T02:18:10Z

This comment contains run-slow, running the specified jobs:

models: ["models/vibevoice_acoustic_tokenizer"]
quantizations: []

github-actions · 2026-01-22T02:29:42Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

ebezzam · 2026-01-22T02:31:16Z

run-slow: vibevoice_acoustic_tokenizer

github-actions · 2026-01-22T02:32:41Z

This comment contains run-slow, running the specified jobs:

models: ["models/vibevoice_acoustic_tokenizer"]
quantizations: []

github-actions · 2026-01-22T02:38:13Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

vibevoice_acoustic_tokenizer:
tests/models/vibevoice_acoustic_tokenizer/test_modeling_vibevoice_acoustic_tokenizer.py::VibeVoiceAcousticTokenizerIntegrationTest::test_batch_integration

ebezzam · 2026-01-22T03:12:01Z

run-slow: vibevoice_acoustic_tokenizer

github-actions · 2026-01-22T03:13:17Z

This comment contains run-slow, running the specified jobs:

models: ["models/vibevoice_acoustic_tokenizer"]
quantizations: []

github-actions · 2026-01-22T03:21:03Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

ebezzam

@eustlb a self-review with pointers to related discussions on the tokenizer from the (original) main model PR!

ebezzam · 2026-01-22T03:37:32Z

docs/source/en/model_doc/vibevoice_acoustic_tokenizer.md

+
+One key feature of VibeVoice is the use of two continuous speech tokenizers, one for extracting acoustic features and another for semantic features.
+
+A model checkpoint is available at [bezzam/VibeVoice-AcousticTokenizer](https://huggingface.co/bezzam/VibeVoice-AcousticTokenizer)


TODO update to official, current draft: https://huggingface.co/bezzam/VibeVoice-AcousticTokenizer

ebezzam · 2026-01-22T03:37:53Z

docs/source/en/model_doc/vibevoice_acoustic_tokenizer.md

+from transformers.audio_utils import load_audio_librosa
+
+
+model_id = "bezzam/VibeVoice-AcousticTokenizer"


TODO update to official

src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py

ebezzam · 2026-01-22T03:40:29Z

src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py

+
+    @can_return_tuple
+    @auto_docstring
+    def sample(self, latents):


Wdyt of this new method? Related to discussion here

ebezzam · 2026-01-22T03:42:57Z

src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py

+        return hidden_states
+
+
+class VibeVoiceAcousticTokenizerDecoder(nn.Module):


Wdyt of refactoring of Decoder (and Encoder)? Related to this discussion

ebezzam · 2026-01-22T03:44:40Z

...rmers/models/vibevoice_acoustic_tokenizer/feature_extraction_vibevoice_acoustic_tokenizer.py

+
+        # Ensure torch tensors and mono
+        for idx, example in enumerate(audio):
+            example = torch.tensor(example, dtype=torch.float32)


Direct casting to torch tensors. Related to this discussion

FYI I moved the feature extractor to the tokenizer, as it actually makes more sense here (needed by the tokenizer rather than the main model, which needs it because of the tokenizer)

…nsformers into vibevoice_acoustic_tokenizer

src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py

eustlb · 2026-01-22T09:33:45Z

...ansformers/models/vibevoice_acoustic_tokenizer/convert_vibevoice_acoustic_tokenizer_to_hf.py

+    updated_state_dict = {}
+
+    for key, value in state_dict.items():
+        new_key = key


note that ideally we would prefer a key mapping which is much cleaner and clearer but that's ok for now

Ok I'll do for next models!

src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py

…ibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py

ebezzam · 2026-02-03T10:49:04Z

run-slow: vibevoice_acoustic_tokenizer

github-actions · 2026-02-03T10:50:26Z

This comment contains run-slow, running the specified jobs:

models: ["models/vibevoice_acoustic_tokenizer"]
quantizations: []

github-actions · 2026-02-03T11:00:13Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	9b208eeb	merge commit
PR	9eb54f3c	branch commit
main	b6a202f8	base commit

✅ No failing test specific to this PR 🎉 👏 !

ebezzam · 2026-02-03T12:41:17Z

run-slow: vibevoice_acoustic_tokenizer

github-actions · 2026-02-03T12:42:31Z

This comment contains run-slow, running the specified jobs:

models: ["models/vibevoice_acoustic_tokenizer"]
quantizations: []

ebezzam · 2026-02-03T12:46:28Z

src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py

+    audio: torch.FloatTensor | None = None
+    latents: torch.FloatTensor | None = None
+    padding_cache: Optional["VibeVoiceAcousticTokenizerConv1dPaddingCache"] = None


do we compute a loss for training?

Encodec doesn't have

DAC "has" one, but not on the decoder output

Xcodec doesn't have

if there is no obvious way to do it, we can leave it for later :)

github-actions · 2026-02-03T12:53:24Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	a40bccef	merge commit
PR	1465b606	branch commit
main	36ec3bfa	base commit

✅ No failing test specific to this PR 🎉 👏 !

eustlb

Nice work! Thanks a lot for iterating and being patient with my reviews 🤗
ready for a final review @ArthurZucker

eustlb · 2026-02-03T17:58:21Z

src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py

+    audio: torch.FloatTensor | None = None
+    latents: torch.FloatTensor | None = None
+    padding_cache: Optional["VibeVoiceAcousticTokenizerConv1dPaddingCache"] = None


if there is no obvious way to do it, we can leave it for later :)

eustlb · 2026-02-03T17:58:48Z

src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py

+        if use_cache and padding_cache is None:
+            padding_cache = VibeVoiceAcousticTokenizerConv1dPaddingCache(
+                num_layers=self.encoder.num_conv_layers,
+                per_layer_padding=self.encoder.per_conv_layer_padding,
+                per_layer_padding_mode=self.encoder.per_conv_layer_padding_mode,
+                per_layer_in_channels=self.encoder.per_conv_layer_in_channels,
+            )


eustlb · 2026-02-03T18:00:41Z

...rmers/models/vibevoice_acoustic_tokenizer/feature_extraction_vibevoice_acoustic_tokenizer.py

+        self,
+        audio: AudioInput,
+        sampling_rate: int | None = None,
+        padding: bool | str | PaddingStrategy | None = True,
+        pad_to_multiple_of: int | None = None,
+        return_attention_mask: bool | None = True,


yes ok for me

ArthurZucker

Great work! LGTM i am a bit bothered by the way cache is handled, but this is something that will be refactored later on!

Main comment is if we can find a way to put stuff that the cache requires with the cache!

Otherwise LGTM 🤗

ArthurZucker · 2026-02-05T14:16:17Z

src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py

+        # Parameters for cache creation
+        self.num_conv_layers = sum(depth + 1 for depth in config.depths) + 1
+        self.per_conv_layer_padding = [self.stem.conv.causal_padding]
+        self.per_conv_layer_in_channels = [self.stem.conv.conv.in_channels]
+        self.per_conv_layer_padding.extend([block.mixer.causal_padding for block in self.stem.stage])
+        self.per_conv_layer_in_channels.extend([block.mixer.conv.in_channels for block in self.stem.stage])
+
+        for layer in self.conv_layers:
+            self.per_conv_layer_padding.append(layer.conv.causal_padding)
+            self.per_conv_layer_in_channels.append(layer.conv.conv.in_channels)
+            self.per_conv_layer_padding.extend([block.mixer.causal_padding for block in layer.stage])
+            self.per_conv_layer_in_channels.extend([block.mixer.conv.in_channels for block in layer.stage])
+
+        self.per_conv_layer_padding.append(self.head.causal_padding)
+        self.per_conv_layer_in_channels.append(self.head.conv.in_channels)
+        self.per_conv_layer_padding_mode = ["constant" for _ in self.per_conv_layer_padding]


this is TBH a bit weird to have!

especially because its completely unused

Good point, I've moved the creation of these variables to when the cache is created, like in Mimi: https://github.com/huggingface/transformers/blob/main/src%2Ftransformers%2Fmodels%2Fmimi%2Fmodeling_mimi.py#L1589-L1599

however, as discussed offline, we should think of a way to refactor MimiConv1dPaddingCache so it doesn't require so much overhead when creating it, cc @eustlb

github-actions · 2026-02-05T15:22:37Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, vibevoice_acoustic_tokenizer

…nsformers into vibevoice_acoustic_tokenizer

* Add vibevoice tokenizer files. * Address style tests. * Revert to expected outputs previously computed on runner. * Enable encoder output test. * Update expected output from runner * Add note on expected outputs * remove code link and better init * Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> * Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> * Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> * Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> * modular * Same changes to decoder layers. * Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> * doc nits * Use decoder_depths for decoder! * Doc nits * Nits * Trim feature extraction for tensor only usage. * Add cache logic to encoder. * Nit * Revert to previous sampling approach. * Nits * Better logic for vae sampling? * More standard conversion script. * Revert to sample flag * Nits * Docs, cleanup, nits. * Nit * Nit * Skip parallelism * Shift cache creation to when it's used. * Updated checkpoint path --------- Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

Add vibevoice tokenizer files.

d4b2855

ebezzam added New model Audio labels Jan 22, 2026

Address style tests.

598f171

Revert to expected outputs previously computed on runner.

fc16b6c

Enable encoder output test.

2a3d075

Update expected output from runner

00e168f

ebezzam commented Jan 22, 2026

View reviewed changes

ebezzam added 2 commits January 22, 2026 04:53

Add note on expected outputs

773f84d

Merge branch 'vibevoice_acoustic_tokenizer' of github.com:ebezzam/tra…

a970422

…nsformers into vibevoice_acoustic_tokenizer

eustlb reviewed Jan 22, 2026

View reviewed changes

ebezzam and others added 6 commits January 22, 2026 19:59

remove code link and better init

5025b51

Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_v…

be050f1

…ibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_v…

81ed3d2

…ibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_v…

a3da244

…ibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_v…

372a0dd

…ibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

modular

f9e342b

ebezzam mentioned this pull request Jan 30, 2026

Add VibeVoice ASR #43625

Merged

6 tasks

Better logic for vae sampling?

c834941

ebezzam commented Jan 30, 2026

View reviewed changes

src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py Outdated Show resolved Hide resolved

ebezzam added 3 commits January 30, 2026 17:07

More standard conversion script.

35f20fd

Revert to sample flag

e298a79

Nits

1413ce4

This was referenced Feb 1, 2026

[PR] Add VibeVoice Acoustic Tokenizer Sandgarden-Demo/transformers#26

Open

[PR] Add VibeVoice ASR Sandgarden-Demo/transformers#68

Open

ebezzam and others added 2 commits February 3, 2026 10:24

Merge branch 'main' into vibevoice_acoustic_tokenizer

b96f948

Docs, cleanup, nits.

9eb54f3

Nit

c034ab5

ebezzam added 2 commits February 3, 2026 13:31

Nit

6d113ac

Skip parallelism

1465b60

ebezzam commented Feb 3, 2026

View reviewed changes

eustlb approved these changes Feb 3, 2026

View reviewed changes

ArthurZucker approved these changes Feb 5, 2026

View reviewed changes

Shift cache creation to when it's used.

27510f0

ebezzam and others added 3 commits February 5, 2026 16:33

Merge branch 'main' into vibevoice_acoustic_tokenizer

f4bc073

Updated checkpoint path

ec8e64e

Merge branch 'vibevoice_acoustic_tokenizer' of github.com:ebezzam/tra…

d3556cb

…nsformers into vibevoice_acoustic_tokenizer

ebezzam merged commit 281eeef into huggingface:main Feb 6, 2026
25 checks passed

vasqu mentioned this pull request Feb 6, 2026

[Repo Consistency] Fix rms norm #43803

Merged


		One key feature of VibeVoice is the use of two continuous speech tokenizers, one for extracting acoustic features and another for semantic features.

		A model checkpoint is available at [bezzam/VibeVoice-AcousticTokenizer](https://huggingface.co/bezzam/VibeVoice-AcousticTokenizer)

		from transformers.audio_utils import load_audio_librosa


		model_id = "bezzam/VibeVoice-AcousticTokenizer"

		return hidden_states


		class VibeVoiceAcousticTokenizerDecoder(nn.Module):

Conversation

ebezzam commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

ebezzam commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

CI Results

Model CI Report

❌ Failed tests

Uh oh!

ebezzam commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

CI Results

Uh oh!

ebezzam commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

CI Results

Model CI Report

❌ Failed tests

Uh oh!

ebezzam commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

CI Results

Uh oh!

ebezzam left a comment

Choose a reason for hiding this comment

Uh oh!

ebezzam Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebezzam Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ebezzam commented Feb 3, 2026

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

github-actions bot commented Feb 3, 2026

CI Results

Commit Info

Uh oh!

ebezzam commented Feb 3, 2026

Uh oh!

github-actions bot commented Feb 3, 2026

ebezzam commented Jan 22, 2026 •

edited

Loading

ebezzam Jan 22, 2026 •

edited

Loading

ebezzam Jan 22, 2026 •

edited

Loading