Add VibeVoice Acoustic Tokenizer#43400
Conversation
|
run-slow: vibevoice_acoustic_tokenizer |
|
This comment contains models: ["models/vibevoice_acoustic_tokenizer"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
CI ResultsModel CI Report❌ Failed tests
|
|
run-slow: vibevoice_acoustic_tokenizer |
|
This comment contains models: ["models/vibevoice_acoustic_tokenizer"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
run-slow: vibevoice_acoustic_tokenizer |
|
This comment contains models: ["models/vibevoice_acoustic_tokenizer"] |
CI ResultsModel CI Report❌ Failed tests
|
|
run-slow: vibevoice_acoustic_tokenizer |
|
This comment contains models: ["models/vibevoice_acoustic_tokenizer"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
|
||
| One key feature of VibeVoice is the use of two continuous speech tokenizers, one for extracting acoustic features and another for semantic features. | ||
|
|
||
| A model checkpoint is available at [bezzam/VibeVoice-AcousticTokenizer](https://huggingface.co/bezzam/VibeVoice-AcousticTokenizer) |
There was a problem hiding this comment.
TODO update to official, current draft: https://huggingface.co/bezzam/VibeVoice-AcousticTokenizer
| from transformers.audio_utils import load_audio_librosa | ||
|
|
||
|
|
||
| model_id = "bezzam/VibeVoice-AcousticTokenizer" |
There was a problem hiding this comment.
TODO update to official
src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Outdated
Show resolved
Hide resolved
|
|
||
| @can_return_tuple | ||
| @auto_docstring | ||
| def sample(self, latents): |
There was a problem hiding this comment.
Wdyt of this new method? Related to discussion here
| return hidden_states | ||
|
|
||
|
|
||
| class VibeVoiceAcousticTokenizerDecoder(nn.Module): |
There was a problem hiding this comment.
Wdyt of refactoring of Decoder (and Encoder)? Related to this discussion
|
|
||
| # Ensure torch tensors and mono | ||
| for idx, example in enumerate(audio): | ||
| example = torch.tensor(example, dtype=torch.float32) |
There was a problem hiding this comment.
Direct casting to torch tensors. Related to this discussion
FYI I moved the feature extractor to the tokenizer, as it actually makes more sense here (needed by the tokenizer rather than the main model, which needs it because of the tokenizer)
…nsformers into vibevoice_acoustic_tokenizer
src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Outdated
Show resolved
Hide resolved
| updated_state_dict = {} | ||
|
|
||
| for key, value in state_dict.items(): | ||
| new_key = key |
There was a problem hiding this comment.
note that ideally we would prefer a key mapping which is much cleaner and clearer but that's ok for now
There was a problem hiding this comment.
Ok I'll do for next models!
src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Show resolved
Hide resolved
src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Outdated
Show resolved
Hide resolved
src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Outdated
Show resolved
Hide resolved
…ibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…ibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…ibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
…ibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py
Outdated
Show resolved
Hide resolved
|
run-slow: vibevoice_acoustic_tokenizer |
|
This comment contains models: ["models/vibevoice_acoustic_tokenizer"] |
|
run-slow: vibevoice_acoustic_tokenizer |
|
This comment contains models: ["models/vibevoice_acoustic_tokenizer"] |
| audio: torch.FloatTensor | None = None | ||
| latents: torch.FloatTensor | None = None | ||
| padding_cache: Optional["VibeVoiceAcousticTokenizerConv1dPaddingCache"] = None |
There was a problem hiding this comment.
if there is no obvious way to do it, we can leave it for later :)
eustlb
left a comment
There was a problem hiding this comment.
Nice work! Thanks a lot for iterating and being patient with my reviews 🤗
ready for a final review @ArthurZucker
| audio: torch.FloatTensor | None = None | ||
| latents: torch.FloatTensor | None = None | ||
| padding_cache: Optional["VibeVoiceAcousticTokenizerConv1dPaddingCache"] = None |
There was a problem hiding this comment.
if there is no obvious way to do it, we can leave it for later :)
| if use_cache and padding_cache is None: | ||
| padding_cache = VibeVoiceAcousticTokenizerConv1dPaddingCache( | ||
| num_layers=self.encoder.num_conv_layers, | ||
| per_layer_padding=self.encoder.per_conv_layer_padding, | ||
| per_layer_padding_mode=self.encoder.per_conv_layer_padding_mode, | ||
| per_layer_in_channels=self.encoder.per_conv_layer_in_channels, | ||
| ) |
| self, | ||
| audio: AudioInput, | ||
| sampling_rate: int | None = None, | ||
| padding: bool | str | PaddingStrategy | None = True, | ||
| pad_to_multiple_of: int | None = None, | ||
| return_attention_mask: bool | None = True, |
ArthurZucker
left a comment
There was a problem hiding this comment.
Great work! LGTM i am a bit bothered by the way cache is handled, but this is something that will be refactored later on!
Main comment is if we can find a way to put stuff that the cache requires with the cache!
Otherwise LGTM 🤗
| # Parameters for cache creation | ||
| self.num_conv_layers = sum(depth + 1 for depth in config.depths) + 1 | ||
| self.per_conv_layer_padding = [self.stem.conv.causal_padding] | ||
| self.per_conv_layer_in_channels = [self.stem.conv.conv.in_channels] | ||
| self.per_conv_layer_padding.extend([block.mixer.causal_padding for block in self.stem.stage]) | ||
| self.per_conv_layer_in_channels.extend([block.mixer.conv.in_channels for block in self.stem.stage]) | ||
|
|
||
| for layer in self.conv_layers: | ||
| self.per_conv_layer_padding.append(layer.conv.causal_padding) | ||
| self.per_conv_layer_in_channels.append(layer.conv.conv.in_channels) | ||
| self.per_conv_layer_padding.extend([block.mixer.causal_padding for block in layer.stage]) | ||
| self.per_conv_layer_in_channels.extend([block.mixer.conv.in_channels for block in layer.stage]) | ||
|
|
||
| self.per_conv_layer_padding.append(self.head.causal_padding) | ||
| self.per_conv_layer_in_channels.append(self.head.conv.in_channels) | ||
| self.per_conv_layer_padding_mode = ["constant" for _ in self.per_conv_layer_padding] |
There was a problem hiding this comment.
this is TBH a bit weird to have!
There was a problem hiding this comment.
especially because its completely unused
There was a problem hiding this comment.
Good point, I've moved the creation of these variables to when the cache is created, like in Mimi: https://github.com/huggingface/transformers/blob/main/src%2Ftransformers%2Fmodels%2Fmimi%2Fmodeling_mimi.py#L1589-L1599
however, as discussed offline, we should think of a way to refactor MimiConv1dPaddingCache so it doesn't require so much overhead when creating it, cc @eustlb
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, vibevoice_acoustic_tokenizer |
…nsformers into vibevoice_acoustic_tokenizer
* Add vibevoice tokenizer files. * Address style tests. * Revert to expected outputs previously computed on runner. * Enable encoder output test. * Update expected output from runner * Add note on expected outputs * remove code link and better init * Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> * Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> * Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> * Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> * modular * Same changes to decoder layers. * Update src/transformers/models/vibevoice_acoustic_tokenizer/modular_vibevoice_acoustic_tokenizer.py Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com> * doc nits * Use decoder_depths for decoder! * Doc nits * Nits * Trim feature extraction for tensor only usage. * Add cache logic to encoder. * Nit * Revert to previous sampling approach. * Nits * Better logic for vae sampling? * More standard conversion script. * Revert to sample flag * Nits * Docs, cleanup, nits. * Nit * Nit * Skip parallelism * Shift cache creation to when it's used. * Updated checkpoint path --------- Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
What does this PR do?
Splitting off acoustic tokenizer from #40546
Such that VibeVoice ASR can be done in a separate / independent PR
Model card: https://huggingface.co/bezzam/VibeVoice-AcousticTokenizer
cc @eustlb