04 Mar 17:42

aad13b8

v5.3.0: EuroBERT, VibeVoice ASR, TimesFM2.5, PP-DocLayoutV2, OlmoHybrid, ModernVBert, Higgs Audio V2 Latest

Latest

New Model additions

EuroBERT

EuroBERT is a multilingual encoder model based on a refreshed transformer architecture, akin to Llama but with bidirectional attention. It supports a mixture of European and widely spoken languages, with sequences of up to 8192 tokens.

Links: Documentation | Paper | Blog Post

Add eurobert (#39455) by @ArthurZucker in #39455

VibeVoice ASR

VibeVoice ASR is an automatic speech recognition model from Microsoft that combines acoustic and semantic audio tokenizers with a causal language model for robust speech-to-text transcription. The model uses VibeVoice's acoustic and semantic tokenizers that process audio at 24kHz, paired with a Qwen2-based language decoder for generating transcriptions. It can process up to 60 minutes of continuous audio input, supports customized hotwords, performs joint ASR/diarization/timestamping, and handles over 50 languages with code-switching support.

Links: Documentation | Paper

Add VibeVoice ASR (#43625) by @ebezzam in #43625

TimesFM2.5

TimesFM 2.5 is a pretrained time-series foundation model that uses a decoder-only attention architecture with input patching for forecasting. The model is designed to provide accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities without requiring dataset-specific training. It builds on the original TimesFM architecture with enhancements including rotary attention, QK normalization, per-dimension attention scaling, and continuous quantile prediction.

Links: Documentation | Paper

Timesfm 2.5 (#41763) by @kashif in #41763

PP-DocLayoutV2

PP-DocLayoutV2 is a dedicated lightweight model for layout analysis, focusing specifically on element detection, classification, and reading order prediction. The model is composed of two sequentially connected networks: an RT-DETR-based detection model that performs layout element detection and classification, followed by a pointer network that orders these layout elements. It is designed to analyze document layouts by identifying and organizing various layout components in their proper reading sequence.

Links: Documentation

[Model] Add PP-DocLayoutV2 Model Support (#43018) by @zhang-prog in #43018

OlmoHybrid

OLMo Hybrid is a hybrid architecture model from Ai2 that combines standard transformer attention layers with linear attention layers using the Gated Deltanet. This hybrid approach aims to improve efficiency while maintaining model quality by interleaving full attention layers with linear attention layers. The model uses a custom cache system that handles both KV cache for attention layers and recurrent state for linear attention layers.

Links: Documentation

Add OLMo Hybrid model (#43358) by @yanhong-lbh in #43358

ModernVBert

ModernVBert is a Vision-Language encoder that combines ModernBert with a SigLIP vision encoder. It is optimized for visual document understanding and retrieval tasks, making it suitable for processing documents that contain both text and visual elements.

Links: Documentation | Paper

Add ModernVBERT models (#42504) by @paultltc in #42504

ColModernVBert

ColModernVBert is a model for efficient visual document retrieval that leverages ModernVBert to construct multi-vector embeddings directly from document images, following the ColPali approach. The model enables retrieval and scoring of visual documents by processing both text queries and document images to generate embeddings that can be compared for relevance scoring.

Links: Documentation | Paper

Add ModernVBERT models (#42504) by @paultltc in #42504

Higgs Audio V2

Higgs Audio V2 is a powerful audio foundation model developed by Boson AI that was pretrained on over 10 million hours of audio data and diverse text data. Despite having no post-training or fine-tuning, the model excels in expressive audio generation thanks to its deep language and acoustic understanding. The model supports various audio generation tasks including single-speaker and multi-speaker smart voice, zero-shot voice cloning, and multi-speaker voice cloning.

Links: Documentation

Add Higgs Audio V2 Model (#40294) by @szhengac in #40294

Higgs Audio V2 Tokenizer

The Higgs Audio V2 Tokenizer is an audio tokenization model that operates at a low frame rate of 25 fps while maintaining high audio quality, effectively halving the frame rate of many baseline models. It uses unified 24 kHz training that mixes speech, music, and sound-event clips in one model to capture both semantic and acoustic details, facilitating the training of audio language models. The model enables fast inference by avoiding diffusion steps, with an encoder/decoder architecture that processes batches quickly for real-time or large-scale tasks.

Links: Documentation

Add Higgs Audio V2 Model (#40294) by @szhengac in #40294

Breaking changes

Tensor parallelism (TP) support for dense and MoE decoder-only models has been fixed and stabilized, requiring users to update their TP configurations and conversion mappings accordingly.

🚨 fix + tests dense & MoE TP all reduce (decoder only) (#43722) by @3outeille

The Ernie4.5 VL MoE model class and configuration names have been renamed to align with vLLM/SGLang conventions, requiring users to update any references to the old model names in their code.

🚨 [Ernie 4.5 VL Moe] Fix up namings to vllm/sglang convention (#44299) by @vasqu

Several pipeline tasks have been removed or updated in the V5 cleanup (including question-answering, visual-question-answering, and image-to-image), requiring users to migrate to the replacement pipelines or updated task names.

🚨 More V5 pipeline cleanup (#43325) by @Rocketknight1

3D position IDs for vision-language models have been unified under a common interface (sourced from qwen2-vl), requiring users of affected VLMs (e.g., Ernie, GLM4V) to update their processors and any code that manually constructs position IDs.

🚨 Unify 3D position ids (#43972) by @zucchini-nlp

🚨 Tokenizer x vLLM fixes 🚨 :

Unigram tokenizers were missing the spm precompiled charsmap support. We ran an overall v4 vs v5 regression test and fixed what we had missed.

This was done in:

[vllm + v5 fix] handle TokenizersBackend fallback properly for v5 (#44255) by @itazap

Generation

Generation input preparation was significantly refactored to stop relying on cache_position and instead pass pre-sliced input_ids/inputs_embeds directly to prepare_inputs_for_generation, simplifying the generation loop and laying groundwork for broader cache_position removal. Several bug fixes were also applied, including correct sampling for HiggsAudioV2, flaky cache-equality test stabilization for Idefics, and restored generation integration tests.

[higgs-audio-v2] fix sampling (#44386) by @eustlb in [#44386]
fix(flaky): idefics generate cache flake (#44180) by @tarekziade in [#44180]
Fix generation integration tests (#44225) by @zucchini-nlp in [#44225]
[generate] Always pass full input_ids in prepare_inputs_for_generation (#44226) by @Cyrilvallez in [#44226]
fix: HiggsAudioV2 cached decode inputs in compiled generation (#44201) by @tarekziade in [#44201]
[generate] Completely stop relying on cache_position to prepare inputs (#44130) by @Cyrilvallez in [#44130]
Simplify input preparation in generate (#44126) by @Cyrilvallez in [#44126]

Tokenization

Several tokenization bugs were fixed in this release, including resolving an AttributeError in `MLukeToken...

Contributors

kashif, tarekziade, and 58 other contributors

Assets 2

16 Feb 18:55

LysandreJik

v5.2.0

7d9754a

v5.2.0: GLM-5, Qwen3.5, Voxtral Realtime, VibeVoice Acoustic Tokenizer

New Model additions

VoxtralRealtime

VoxtralRealtime is a streaming speech-to-text model from Mistral AI, designed for real-time automatic speech recognition (ASR). Unlike the offline Voxtral model which processes complete audio files, VoxtralRealtime is architected for low-latency, incremental transcription by processing audio in chunks as they arrive.

The model combines an audio encoder with a Mistral-based language model decoder, using time conditioning embeddings and causal convolutions with padding caches to enable efficient streaming inference.

Add Voxtral Realtime (#43769) by @eustlb

GLM-5 - GlmMoeDsa

The zAI team launches GLM-5, and introduces it as such:

GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.

Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed slime, a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.

Add GlmMoeDsa (#43858) by @Cyrilvallez

Qwen3.5, Qwen3.5 Moe

The Qwen team launches Qwen 3.5, and introduces it as such:

We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.

Adding Support for Qwen3.5 (#43830) by @bozheng-hit

VibeVoice Acoustic Tokenizer

VibeVoice is a novel framework for synthesizing high-fidelity, long-form speech with multiple speakers by employing a next-token diffusion approach within a Large Language Model (LLM) structure. It's designed to capture the authentic conversational "vibe" and is particularly suited for generating audio content like podcasts and multi-participant audiobooks.

One key feature of VibeVoice is the use of two continuous audio tokenizers, one for extracting acoustic features and another for semantic features.

Add VibeVoice Acoustic Tokenizer (#43400) by @ebezzam

Breaking changes

🚨 [Attn] New attn mask interface everywhere (#42848)
🚨 Modify ModernBERT's default attention implementation to stop using FA (#43764)

🚨 This one is quite breaking for super super super old modles: 🚨 🚨

fix: Prevent AutoTokenizer type mismatch from directory name substrin… (#43791)
If the config does not have a model-type field, we no longer check the name of the folder like for https://huggingface.co/prajjwal1/bert-tiny/blob/main/config.json

Bugfixes and improvements

[docs] deploying (#43241) by @stevhliu
[Trainer] Move NEFTune impl to standalone functions (#43714) by @SunMarc
Fix convert_rope_params_to_dict so it uses rope_theta from the config (#43766) by @hmellor
Bump dev version (#43777) by @qgallouedec
Improved AGENTS.md (#43763) by @tarekziade
Fix-release-ubild (#43773) by @ArthurZucker
unpin torch for CircleCI (#43790) by @ydshieh
[Modular Dependencies] Fixup qwen rms norms (#43772) by @vasqu
fix(testing): Fix BLOOM tokenizer, CLAP audio features, and CLVP text tester usage in tests (#43798) by @harshaljanjani
Remove unconditional train_batch_size assignment (#43770) by @lordaarush
[Repo Consistency] Fix rms norm (#43803) by @vasqu
fix: Prevent AutoTokenizer type mismatch from directory name substrin… (#43791) by @tarekziade
Refactor trainer data_collator and callbacks tests (#43776) by @SunMarc
[core] Faster and thread-safe check_model_inputs implementation (#43765) by @Cyrilvallez
[Trainer] use deepspeed SP process group when Accelerate doesn’t build a mesh (#43799) by @kashif
fix(flaky): enforce manual seed to reduce flakiness (#43794) by @tarekziade
Add TRL CI bot workflow to trigger tests on PR comments (#43809) by @qgallouedec
Fix DeepSpeed model preparation logic in Trainer class (#43780) by @qgallouedec
[docs] reveal more in toctree (#43808) by @stevhliu
Fix markdown documentation (#43076) by @cyyever
Fix slack-report workflow file (#43851) by @ydshieh
add do_sample=False to qwen2_5_vl model tests to stablize the output (#43728) by @kaixuanliu
Fix incorrect timestamp calculation in Qwen3VL Processor (#43659) by @jonathan-fulton
Remove GPU tracking from TrackioCallback and remove env var support (#43371) by @qgallouedec
Add id and resume support to SwanLab integration (#43719) by @i-pj
fix gptoss crash in tp (#43853) by @sywangyi
Delete batch_split from EncoderDecoderCache (#43814) by @cyyever
delete unnecessary code to make moe compatible to full graph compile (#43855) by @kaixuanliu
Update ModelType for Unigram tokenizer (#43860) by @pavel-esir
[docs] Remove pipeline() examples from summarization/translation tasks (#43831) by @Mr-Neutr0n
Fix video interpolation in pe_audio_video (#43811) by @Rocketknight1
Look for the pad_token_id in the right place for Llama4 (#43539) by @Rocketknight1
Fix cardinality error for DETR models without explicit background class (#43513) by @heathdutton
docs: Add Switch Transformers docstring notes and update spectrogram comment (#43336) by @harshaljanjani
[xLSTM] Fix bugs preventing small model training (#43209) by @Anri-Lombard
docs: correct typo 'neccessary' to 'necessary' (#43868) by @thecaptain789
Improve PR comment CI feedback (#43852) by @ydshieh
Fix init weights in remote code (#43768) by @zucchini-nlp
Fix GlmMoeDsaConfig default mlp_layer_types in modular conversion (#43876) by @OiPunk
[MistralCommonBackend] fix loading proc (#43887) by @eustlb
[Jamba] Fallback to slow path and warn instead of error out (#43889) by @vasqu
Fix SwanLab callback to forward resume init args (#43848) by @OiPunk
Fix old tech stack in doc (#43879) by @cyyever
Update TrainingArguments (#43806) by @SunMarc
Remove unnecessary code or checks for PT 2.4+ (#43787) by @cyyever
Make it possible to evaluate when using sequence parallel in HF Trainer (#43517) by @jp1924
[Trainer] Move optimizer cls init to trainer_optimizer.py (#43738) by @SunMarc
fix the error of tests/quantization/fbgemm_fp8/test_fbgemm_fp8.py::Fb… (#43547) by @sywangyi
fix fbgemm fp8 multi-device load failure. (#43581) by @sywangyi
Refactor trainer init (#43807) by @SunMarc
[fix] Use last_hidden_state key from get_image_features for llama4 (#43882) by @tomaarsen
[Docs] Add docs for GLM-OCR and fix EomT-DINOv3 (#43710) by @NielsRogge
Update hub metadata (#43892) by @zucchini-nlp
[fix] DAC model: Apply STE in Dac.from_latents to match the forward pass (#43820) by @harshaljanjani
Separate check_model_inputs into capture_outputs and merge_with_config_defaults + ensure correctness (#43862) by @Cyrilvallez
Remove mask slicing in all eager attentions (#42186) by @Cyrilvallez
Fix expected DAC outputs due to (old) change in CI settings. (#43896) by @ebezzam
Minor changes trainer (#43744) by @SunMarc
adding BC for custom toks accessing slow tok attrs deprecated in v5 (#43898) by @itazap
Fix typo in quantization_operations in PEFT integrations (#43821) by @redpanda1995
Update KERNELS_MIN_VERSION to 0.10.2 to be the same as setup.py (#43753) by @cyyever
Decorate cache updates with no_grad, just in case (#43897) by @Rocketknight1
revert place_model_on_device to property (#43895) by @SunMarc
Train sampler unification (#43138) by @jiosephlee
fix(moe): Handle dtype mismatch in torch._grouped_mm with autocast (#43839) by @Mr-Neutr0n
Fix missing fast image patch counter in Glm46V (#43877) by @OiPunk
Fix old tech stack in doc (#43902) by @cyyever
Mov...

Contributors

kashif, tarekziade, and 46 other contributors

Assets 2

05 Feb 15:44

LysandreJik

v5.1.0

3fa4da7

v5.1.0: EXAONE-MoE, PP-DocLayoutV3, Youtu-LLM, GLM-OCR

New Model additions

EXAONE-MoE

K-EXAONE is a large-scale multilingual language model developed by LG AI Research. Built using a Mixture-of-Experts architecture, K-EXAONE features 236 billion total parameters, with 23 billion active during inference. Performance evaluations across various benchmarks demonstrate that K-EXAONE excels in reasoning, agentic capabilities, general knowledge, multilingual understanding, and long-context processing.

Add EXAONE-MoE implementations (#43080) by @nuxlear

PP-DocLayoutV3

PP-DocLayoutV3 is a unified and high-efficiency model designed for comprehensive layout analysis. It addresses the challenges of complex physical distortions—such as skewing, curving, and adverse lighting—by integrating instance segmentation and reading order prediction into a single, end-to-end framework.

[Model] Add PP-DocLayoutV3 Model Support (#43098) by @zhang-prog

Youtu-LLM

Youtu-LLM is a new, small, yet powerful LLM, contains only 1.96B parameters, supports 128k long context, and has native agentic talents. On general evaluations, Youtu-LLM significantly outperforms SOTA LLMs of similar size in terms of Commonsense, STEM, Coding and Long Context capabilities; in agent-related testing, Youtu-LLM surpasses larger-sized leaders and is truly capable of completing multiple end2end agent tasks.

Add Youtu-LLM model (#43166) by @LuJunru

GlmOcr

GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.

[GLM-OCR] GLM-OCR Support (#43391)by @zRzRzRzRzRzRzR

Breaking changes

🚨 T5Gemma2 model structure (#43633) - Makes sure that the attn implementation is set to all sub-configs. The config.encoder.text_config was not getting its attn set because we aren't passing it to PreTrainedModel.init. We can't change the model structure without breaking so I manually re-added a call to self.adjust_attn_implemetation in modeling code
🚨 Generation cache preparation (#43679) - Refactors cache initialization in generation to ensure sliding window configurations are now properly respected. Previously, some models (like Afmoe) created caches without passing the model config, causing sliding window limits to be ignored. This is breaking because models with sliding window attention will now enforce their window size limits during generation, which may change generation behavior or require adjusting sequence lengths in existing code.
🚨 Delete duplicate code in backbone utils (#43323) - This PR cleans up backbone utilities. Specifically, we have currently 5 different config attr to decide which backbone to load, most of which can be merged into one and seem redundant
After this PR, we'll have only one config.backbone_config as a single source of truth. The models will load the backbone from_config and load pretrained weights only if the checkpoint has any weights saved. The overall idea is same as in other composite models. A few config arguments are removed as a result.
🚨 Refactor DETR to updated standards (#41549) - standardizes the DETR model to be closer to other vision models in the library.
🚨Fix floating-point precision in JanusImageProcessor resize (#43187) - replaces an int() with round(), expect light numerical differences
🚨 Remove deprecated AnnotionFormat (#42983) - removes a missnamed class in favour of AnnotationFormat.

Bugfixes and improvements

fix(models): Migrate legacy segmentation_indices to out_indices in BeitConfig (#43505) by @harshaljanjani
[docs] Update torch version (#42135) by @stevhliu
Remove SDPA workarounds for torch 2.4+ (#43754) by @cyyever
add use_deterministic to guarantee the consistency for youtu-llm model (#43759) by @kaixuanliu
fix: add compatible_model_types to suppress model type mismatch warnings (#43495) by @leoneperdigao
Fix T5 v1.1 detection (#43681) by @githubnemo
Add moonshine streaming (#43702) by @eustlb
Allow bi-directional attention for all models (#43705) by @Cyrilvallez
Docs: fix Training step by removing tokenizer from trainer initialization (#43733) by @nesjett
Fix scheduler initialization order (#43711) by @SunMarc
Fix accelerate integration import (#43732) by @SunMarc
Update torch minimum version to 2.4 (#41307) by @cyyever
Fix dtype in image-text-to-text pipe (#43731) by @zucchini-nlp
Preventing initialization of siglip's lecun_normal_, default_flax_embed_init in ZeRO3 (#43574) by @jp1924
fix: AttributeError for Qwen3_omni_moe (#43593) by @Vallabh-1504
Improve typing/explanations for general model properties (#43712) by @Cyrilvallez
[Kernels] kernel migration updates for activation kernels (#43518) by @ariG23498
[feat] Allow loading T5Gemma2Encoder with AutoModel (#43559) by @tomaarsen
Added S110 - try-except-pass rule (#43687) by @tarekziade
[docs] benchmarks (#43694) by @stevhliu
fix norm_eps dtype (#43669) by @fschlatt
Llava onevision: output align for tests and add image_sizes input param (#43678) by @kaixuanliu
Fix CLIPOutput attentions not being returned (#43657) by @jonathan-fulton
[Attn] Fixup interface usage after refactor (#43706) by @vasqu
Fix model/processor mismatch in SigLIP2 quantization example (#43652) by @jonathan-fulton
Fix crash of custom models in Notebook or Repl (#43690) by @Cyrilvallez
Simplify TrainingArguments docstring (#43568) by @SunMarc
Composite model inherit automatically all important properties from their children (#43691) by @Cyrilvallez
Update configuration_qwen3.py (#43703) by @francesco-bertolotti
fix gptoss tp crash (#43695) by @sywangyi
[CB] Keep order of incoming requests (#43626) by @remi-or
Fix Apertus model loading (NotImplementedError: Cannot copy out of meta tensor; no data!) (#43473) by @xenova
Remove num_frames in ASR pipeline (#43546) by @jiqing-feng
remove ipex and ccl for xpu and cpu (#42852) by @yao-matrix
update guide with new attr name for toks (#43689) by @itazap
Docs: fix typos in Get started (index, quicktour) (#43666) by @CodeByKodi
the cache class is deprecated by @vasqu (direct commit on main)
custom tok init fix (#43591) by @itazap
More export friendly rewrites and skipping the failing ones (#43436) by @IlyasMoutawwakil
Cast byte_count to int in caching_allocator_warmup for MPS compatibility (#43608) by @tobyliu2004
[Docs] Complete missing Llama4 configuration docs (#43460) by @udaymehta
Fix t5 failures (#43374) by @Abdennacer-Badaoui
Add EoMT with DINOv3 backbone (#41212) by @NielsRogge
Update DBRX docs to reference re-uploaded checkpoint (#43196) by @qgallouedec
[loading] Fix forced upcasting to fp32 (#43683) by @Cyrilvallez
Fix FP8Expert for Qwen (#43670) by @yiliu30
Simplify loading structure (#43589) by @Cyrilvallez
[CB] Refactor logic for inputs and outputs outside of the main API (#43569) by @remi-or
Make sure hub errors are surfaced in PreTrainedTokenizerBase (#43675) by @tarekziade
Fix FP8Expert for DeepSeek R1 (#43616) by @yiliu30
Use correct sampling rate in chat template (#43674) by @zucchini-nlp
[HunYuan] Fix RoPE init (#43411) by @vasqu
XPU now supports MoE kernel(MegaBlocks) implementation (#43435) by @YangKai0616
[Sam] Fixup training flags (#43567) by @vasqu
remove torchao.autoquant from transformers (#43561) by @vkuzo
[DeepSpeed] properly handle MoE weight conversion (#43524) by @kashif
Tie zamba weights correctly (#43623) by @zucchini-nlp
[kernels] Centralize kernels tests (#42819) by @MekkCyber
Fix process_bad_commit_report.py: avoid items to appear in null author in the report (#43662) by @ydshieh
Fix KeyError in check_bad_commit.py (#43655) by @ydshieh
[Benchmark] Minor fix for benchmark: kernel is not correctly called (#43428) by @sywangyi
Add explicit commit info to PR comment CI feedback (#43635) by @ydshieh
Better new failures reporting for PR comment CI (#43629) by @ydshieh
[docs] serving (#42853) by @stevhliu
add XPU expected output for MixedInt8GPT2Test (#43615) by @kaixuanliu
Don't modify mappings in tests (#43634) by @Rocketknight1
Allow Attention and Experts to be used as standalone modules (#43622) by @Cyrilvallez
Don't modify tied_weight_keys in-place (#43619) by @zucchini-nlp
[Rope] Revert #43410 and make inheritance implicit again (#43620) by @vasqu
[vllm compat] Separate renaming from conversion ops (#43621) by @Cyrilvallez
refactor + robusts tests for Tensor Parallel (#42809) by @3outeille
add contiguous operation for diffllama model for xpu to enable compile mode. (#43614) by @kaixuanliu
add xpu expectation for lw_detr model (#43339) by @kaixuanliu
minimax_m2: fix failed test case for XPU (#43324) by @kaixuanliu
Improve new failures reporting (#43628) by @ydshieh
Fix...

Contributors

kashif, tarekziade, and 56 other contributors

Assets 2

26 Jan 10:17

LysandreJik

v5.0.0

08810b1

Transformers v5

Transformers v5 release notes

Highlights
Significant API changes: dynamic weight loading, tokenization
Backwards Incompatible Changes
Bugfixes and improvements

We have a migration guide that will be continuously updated available on the main branch, please check it out in case you're facing issues: migration guide.

Highlights

We are excited to announce the initial release of Transformers v5. This is the first major release in five years, and the release is significant: 1200 commits have been pushed to main since the latest minor release. This release removes a lot of long-due deprecations, introduces several refactors that significantly simplify our APIs and internals, and comes with a large number of bug fixes.

We give an overview of our focus for this release in the following blogpost. In these release notes, we'll focus directly on the refactors and new APIs coming with v5.

This release is the full V5 release. It sets in motion something bigger: going forward, starting with v5, we'll now release minor releases every week, rather than every 5 weeks. Expect v5.1 to follow next week, then v5.2 the week that follows, etc.

We're moving forward with this change to ensure you have access to models as soon as they're supported in the library, rather than a few weeks after.

In order to install this release, please do so with the following:

pip install transformers

For us to deliver the best package possible, it is imperative that we have feedback on how the toolkit is currently working for you. Please try it out, and open an issue in case you're facing something inconsistent/a bug.

Transformers version 5 is a community endeavor, and we couldn't have shipped such a massive release without the help of the entire community.

Significant API changes

Dynamic weight loading

We introduce a new weight loading API in transformers, which significantly improves on the previous API. This
weight loading API is designed to apply operations to the checkpoints loaded by transformers.

Instead of loading the checkpoint exactly as it is serialized within the model, these operations can reshape, merge,
and split the layers according to how they're defined in this new API. These operations are often a necessity when
working with quantization or parallelism algorithms.

This new API is centered around the new WeightConverter class:

class WeightConverter(WeightTransform):
    operations: list[ConversionOps]
    source_keys: Union[str, list[str]]
    target_keys: Union[str, list[str]]

The weight converter is designed to apply a list of operations on the source keys, resulting in target keys. A common
operation done on the attention layers is to fuse the query, key, values layers. Doing so with this API would amount
to defining the following conversion:

conversion = WeightConverter(
    ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"],  # The input layers
    "self_attn.qkv_proj",  # The single layer as output
    operations=[Concatenate(dim=0)],
)

In this situation, we apply the Concatenate operation, which accepts a list of layers as input and returns a single
layer.

This allows us to define a mapping from architecture to a list of weight conversions. Applying those weight conversions
can apply arbitrary transformations to the layers themselves. This significantly simplified the from_pretrained method
and helped us remove a lot of technical debt that we accumulated over the past few years.

This results in several improvements:

Much cleaner definition of transformations applied to the checkpoint
Reversible transformations, so loading and saving a checkpoint should result in the same checkpoint
Faster model loading thanks to scheduling of tensor materialization
Enables complex mix of transformations that wouldn't otherwise be possible (such as quantization + MoEs, or TP + MoEs)

Linked PR: #41580

Tokenization

Just as we moved towards a single backend library for model definition, we want our tokenizers, and the Tokenizer object to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an empty LlamaTokenizer and train it directly on your corpus.

Defining a new tokenizer object should be as simple as this:

from transformers import TokenizersBackend, generate_merges
from tokenizers import pre_tokenizers, Tokenizer
from tokenizers.model import BPE

class Llama5Tokenizer(TokenizersBackend):
    def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
        if vocab is None:
            self._vocab = {
                str(unk_token): 0,
                str(bos_token): 1,
                str(eos_token): 2,
            }

        else:
            self._vocab = vocab

            self._merges = merges

        self._tokenizer = Tokenizer(
            BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
        )
        self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
            replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
        )
        super().__init__(
            tokenizer_object=self._tokenizer,
            unk_token=unk_token,
            bos_token=bos_token,
            eos_token=eos_token,
        )

Once the tokenizer is defined as above, you can load it with the following: Llama5Tokenizer(). Doing this returns you an empty, trainable tokenizer that follows the definition of the authors of Llama5 (it does not exist yet 😉).

The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition.

Backend Architecture Changes: moving away from the slow/fast tokenizer separation

Up to now, transformers maintained two parallel implementations for many tokenizers:

"Slow" tokenizers (tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend.
"Fast" tokenizers (tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.

In v5, we consolidate to a single tokenizer file per model: tokenization_<model>.py. This file will use the most appropriate backend available:

TokenizersBackend (preferred): Rust-based tokenizers from the 🤗 tokenizers library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem:

handling additional tokens
a full python API for setting and updating
automatic parallelization,
automatic offsets
customization
training

SentencePieceBackend: for tokenizers requiring the sentencepiece library. It inherits from PythonBackend.
PythonBackend: a Python implementations of the features provided by tokenizers. Basically allows adding tokens.
MistralCommonBackend: relies on MistralCommon's tokenization library. (Previously known as the MistralCommonTokenizer)

The AutoTokenizer automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use AutoTokenizer.from_pretrained() as before. This allows transformers to be future-proof and modular to easily support future backends.

Defining a tokenizers outside of the existing backends

We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as tokenizers, sentencepiece or mistral-common, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends.

To do so, you can import the PythonBackend (which was previously known as PreTrainedTokenizer). This class encapsulates all the logic related to added tokens, encoding, and decoding.

If you want something even higher up the stack, then PreTrainedTokenizerBase is what PythonBackend inherits from. It contains the very basic tokenizer API features:

encode
decode
vocab_size
get_vocab
convert_tokens_to_ids
convert_ids_to_tokens
from_pretrained
save_pretrained
among a few others

API Changes

1. Direct tokenizer initialization with vocab and merges

Starting with v5, we now enable initializing blank, untrained tokenizers-backed tokenizers:

from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer()

This tokenizer will therefore follow the definition of the LlamaTokenizer as defined in its class definition. It can then be trained on a corpus as can be seen in the tokenizers documentation.

These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers:

from transformers import LlamaTokenizer

vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4}
merges = [("h", "e"), ("l", "l"), ("o", " ")]

tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)

This tokenizer will behave as a Llama-like tokenizer, with an updated vocabulary. This allows comparing different tokenizer classes with the same vocab; therefore enabling the comp...

Contributors

kashif, nihui, and 151 other contributors

Assets 2

16 Jan 10:40

vasqu

v4.57.6

753d611

Patch release v4.57.6

What's Changed

Another fix for qwen vl models that prevented correctly loading the associated model type - this works together with #41808 of the previous patch release.

Fixed incorrect model_type for qwen2vl and qwen2.5vl when config is saved and loaded again by @i3hz in #41758

Full Changelog: v4.57.5...v4.57.6

Contributors

i3hz

Assets 2

26 Jan 10:02

ArthurZucker

v5.0.0rc3

cb5079f

Release candidate v5.0.0rc3 Pre-release

Pre-release

Release candidate v5.0.0rc3

New models:

[GLM-4.7] GLM-Lite Supoort by @zRzRzRzRzRzRzR in #43031
[GLM-Image] AR Model Support for GLM-Image by @zRzRzRzRzRzRzR in #43100
Add LWDetr model by @sbucaille in #40991
Add LightOnOCR model implementation by @baptiste-aubertin in #41621

What's Changed

We are getting closer and closer to the official release!
This RC is focused on removing more of the deprecated stuff, fixing some minors issues, doc updates.

Update Japanese README to match English version by @lilin-1 in #43069
[docs] Deploying by @stevhliu in #42263
[docs] inference engines by @stevhliu in #42932
Fix typos: Remove duplicate duplicate words words by @efeecllk in #43040
[style] Rework ruff rules and update all files by @Cyrilvallez in #43144
[CB] Minor fix in kwargs by @remi-or in #43147
[Bug] qwen2_5_omni: cap generation length to be less than the max_position_embedding in DiT by @sniper35 in #43068
Fix some deprecated practices in torch 2.9 by @Cyrilvallez in #43167
Fix Fuyu processor width dimension bug in _get_num_multimodal_tokens by @Abhinavexists in #43137
Inherit from PreTrainedTokenizerBase by @juliendenize in #43143
Generation config boolean defaults by @zucchini-nlp in #43000
Fix failing BartModelIntegrationTest by @Sai-Suraj-27 in #43160
fix failure of llava/pixtral by @sywangyi in #42985
GemmaTokenizer: remove redundant whitespace pre-tokenizer by @vaibhav-research in #43106
Support auto_doctring in Processors by @yonigozlan in #42101
Fix failing BitModelIntegrationTest by @Sai-Suraj-27 in #43164
[Fp8] Fix experts by @vasqu in #43154
Docs: improve wording for documentation build instructions by @Sailnagale in #43007
[makefile] Cleanup and improve the rules by @Cyrilvallez in #43171
Some new models added stuff that was already removed by @Cyrilvallez in #43179
Fixes and compilation warning in torchao docs by @merveenoyan in #42909
[cache] Remove all deprecated classes by @Cyrilvallez in #43168
Bump huggingface_hub minimal version by @Wauplin in #43188
Rework check_config_attributes.py by @Cyrilvallez in #43191
Fix generation config validation by @zucchini-nlp in #43175
[style] Use 'x | y' syntax for processors as well by @Wauplin in #43189
Remove deprecated objects by @Cyrilvallez in #43170
fix chunked prefill implementation issue-43082 by @marcndo in #43132
Reduce add_dates verbosity by @yonigozlan in #43184
Add support for MiniMax-M2 by @rogeryoungh in #42028
Fix failing salesforce-ctrl, xlm & gpt-neo model generation tests by @Sai-Suraj-27 in #43180
Less verbose library helpers by @Cyrilvallez in #43197
run all test files on CircleCI by @ydshieh in #43146
Clamp temperature to >=1.0 for Dia generation by @Haseebasif7 in #43029
Fix spelling typos in comments and code by @raimbekovm in #43046
[docs] llama.cpp by @stevhliu in #43185
[docs] gptq formatting fix by @victorywwong in #43216
Grouped beam search from config params by @zucchini-nlp in #42472
[Generate] Allow custom config values in generate config by @vasqu in #43181
Fix failing Pix2StructIntegrationTest by @Sai-Suraj-27 in #43229
Fix missing UTF-8 encoding in check_repo.py for Windows compatibility by @aarushisingh04 in #43123
[Tokenizer] Change default value of return_dict to True in doc string for apply_chat_template by @kashif in #43223
Fix failing PhiIntegrationTests by @Sai-Suraj-27 in #43214
Use HF_TOKEN directly and remove require_read_token by @ydshieh in #43233
Fix failing Owlv2ModelIntegrationTest & OwlViTModelIntegrationTest by @Sai-Suraj-27 in #43182
Fix flashattn wrt quantized models by @SunMarc in #43145
Remove unused imports by @cyyever in #43078
Fix unsafe torch.load() in _load_rng_state allowing arbitrary code execution by @ColeMurray in #43140
Reapply modular to examples by @Cyrilvallez in #43234
More robust diff checks in add_dates by @yonigozlan in #43199
docs: fix grammatical error in README.md by @davidfertube in #43236
Fix typo: seperately → separately in lw_detr converter by @skyvanguard in #43235
Qwen-VL video processor accepts min/max pixels by @zucchini-nlp in #43228
Deprecate dtype per sub config by @zucchini-nlp in #42990
Remove more deprecated objects/args by @Cyrilvallez in #43195
[CB] Soft-reset offloading by @remi-or in #43150
Make benchmark-v2 to be device agnostic, to support more torch built-in devices like xpu by @yao-matrix in #43153
Fix benchmark script by @Cyrilvallez in #43253
Adding to run slow by @IlyasMoutawwakil in #43250
Fix failing Vip-llava model integration test by @Sai-Suraj-27 in #43252
Remove deprecated and unused position_ids in all apply_rotary_pos_emb by @Cyrilvallez in #43255
fix _get_test_info in testing_utils.py by @ydshieh in #43259
Fix failing Hiera, SwiftFormer & LED Model integration tests by @Sai-Suraj-27 in #43225
[style] Fix init isort and align makefile and CI by @Cyrilvallez in #43260
[docs] tensorrt-llm by @stevhliu in #43176
[consistency] Ensure models are added to the _toctree.yml by @Cyrilvallez in #43264
Fix failing PegasusX, Mvp & LED model integration tests by @Sai-Suraj-27 in #43245
[CB] Ensure parallel decoding test passes using FA by @remi-or in #43277
fix crash in when running FSDP2+TP by @sywangyi in #43226
[ci] Fixing some failing tests for important models by @Abdennacer-Badaoui in #43231

New Contributors

@efeecllk made their first contribution in #43040
@sniper35 made their first contribution in #43068
@Abhinavexists made their first contribution in #43137
@vaibhav-research made their first contribution in #43106
@Sailnagale made their first contribution in #43007
@rogeryoungh made their first contribution in #42028
@Haseebasif7 made their first contribution in #43029
@victorywwong made their first contribution in #43216
@aarushisingh04 made their first contributi...

Contributors

kashif, ColeMurray, and 34 other contributors

Assets 2

13 Jan 13:29

vasqu

v4.57.5

542e65f

Patch release v4.57.5

What's Changed

Should not have said last patch 😉 These should be the last remaining fixes that got lost in between patches and the transition to v5.

QwenVL: add skipped keys in setattr as well by @zucchini-nlp in #41808
Fix lr_scheduler_parsing by @SunMarc in #41322

Full Changelog: v4.57.4...v4.57.5

Contributors

SunMarc and zucchini-nlp

Assets 2

13 Jan 11:07

vasqu

v4.57.4

bfd87b9

Patch release v4.57.4

What's Changed

Last patch release for v4: We have a few small fixes for remote generation methods (e.g. group beam search), vLLM, and an offline tokenizer fix (if it's already been cached).

Grouped beam search from config params by @zucchini-nlp in #42472
Handle decorator with optional arguments better @hmellor in #42512
fix: make mistral base check conditional to fix offline loading by @Killusions in #42880

New Contributors

@Killusions made their first contribution in #42880

Full Changelog: v4.57.3...v4.57.4

Contributors

hmellor, Killusions, and zucchini-nlp

Assets 2

08 Jan 10:33

ArthurZucker

v5.0.0rc2

57278c9

Release candidate 5.0.0rc2 Pre-release

Pre-release

What's Changed

This release candidate is focused on fixing AutoTokenizer, expanding the dynamic weight loading support, and improving performances with MoEs!

MoEs and performances:

batched and grouped experts implementations by @IlyasMoutawwakil in #42697
Optimize MoEs for decoding using batched_mm by @IlyasMoutawwakil in #43126

Tokenization:

The main issue with the tokenization refactor is that tokenizer_class are now "enforced" when in most cases they are wrong. This took a while to properly isolate and now we try to use TokenizersBackend whenever we can. #42894 has a much more detailed description of the big changes!

use TokenizersBackend by @ArthurZucker in #42894
Fix convert_tekken_tokenizer by @juliendenize in #42592
refactor more tokenizers - v5 guide update by @itazap in #42768
[Tokenizers] Change treatment of special tokens by @vasqu in #42903

Core

Here we focused on boosting the performances of loading weights on device!

[saving] Simplify general logic by @Cyrilvallez in #42766
Do not rely on config for inferring model dtype by @Cyrilvallez in #42838
Improve BatchFeature: stack list and lists of torch tensors by @yonigozlan in #42750
Remove tied weights from internal attribute if they are not tied by @Cyrilvallez in #42871
Enforce call to post_init and fix all of them by @Cyrilvallez in #42873
Simplify tie weights logic by @Cyrilvallez in #42895
Add buffers to _init_weights for ALL models by @Cyrilvallez in #42309
[loading] Really initialize on meta device for huge perf gains by @Cyrilvallez in #42941
Do not use accelerate hooks if the device_map has only 1 device by @Cyrilvallez in #43019
Move missing weights and non-persistent buffers to correct device earlier by @Cyrilvallez in #43021

New models

Sam: Perception Encoder Audiovisual by @eustlb in #42905
adds jais2 model support by @sarathc-cerebras in #42684
Add Pixio pre-trained models by @LiheYoung in #42795
[Ernie 4.5] Ernie VL models by @vasqu in #39585
[loading][TP] Fix device placement at loading-time, and simplify sharding primitives by @Cyrilvallez in #43003
GLM-ASR Support by @zRzRzRzRzRzRzR in #42875

Quantization

[Devstral] Make sure FP8 conversion works correctly by @patrickvonplaten in #42715
Fp8 dq by @SunMarc in #42926
[Quantization] Removing misleading int8 quantization in Finegrained FP8 by @MekkCyber in #42945
Fix deepspeed + quantization by @SunMarc in #43006

Breaking changes

Mostly around processors!

🚨 Fix ConvNeXt image processor default interpolation to BICUBIC by @lukepayyapilli in #42934
🚨 Fix EfficientNet image processor default interpolation to BICUBIC by @lukepayyapilli in #42956
Add fast version of convert_segmentation_map_to_binary_masks to EoMT by @simonreise in #43073
🚨Fix MobileViT image processor default interpolation to BICUBIC by @lukepayyapilli in #43024

Thanks again to everyone !

New Contributors

@ZX-ModelCloud made their first contribution in #42833
@AYou0207 made their first contribution in #42863
@wasertech made their first contribution in #42864
@preetam1407 made their first contribution in #42685
@Taise228 made their first contribution in #41416
@CandiedCode made their first contribution in #42885
@sarathc-cerebras made their first contribution in #42684
@nandan2003 made their first contribution in #42318
@LiheYoung made their first contribution in #42795
@majiayu000 made their first contribution in #42928
@lukepayyapilli made their first contribution in #42934
@leaderofARS made their first contribution in #42966
@qianyue76 made their first contribution in #43095
@stefgina made their first contribution in #43033
@HuiyingLi made their first contribution in #43084
@raimbekovm made their first contribution in #43038
@PredictiveManish made their first contribution in #43053
@pushkar-hue made their first contribution in #42736
@vykhovanets made their first contribution in #43042
@tanmay2004 made their first contribution in #42737
@atultw made their first contribution in #43061

Full Changelog: v5.0.0rc1...v5.0.0rc2

Contributors

HuiyingLi, CandiedCode, and 32 other contributors

Assets 2

08 Jan 10:15

ArthurZucker

v5.0.0rc1

bdc85cb

Release candidate 5.0.0rc1 Pre-release

Pre-release

What's Changed

This release candidate was focused mostly on quantization support with the new dynamic weight loader, and a few notable 🚨 breaking changes🚨:

Default dtype for any model when using from_pretrained is now auto!

Default auto 🚨 🚨 by @ArthurZucker in #42805

Default shard size when saving a model is now 50GB:

🚨🚨 [saving] Default to 50GB shards, and remove non-safe serialization by @Cyrilvallez in #42734
This is now as fast as before thanks to xet, and is just more convenient on the hub.

Kwargs. They are fundamental to enable integration with vllm and other toosl:

Every model forward() should have **kwargs by @Rocketknight1 in #42603

Dynamic weight loader updates:

Mostly QOL and fixed + support back CPU offloading.

mark params as _is_hf_initialized with DS Zero3 from weight conversion by @winglian in #42626
[loading] Allow loading to happen without threading by @Cyrilvallez in #42619
[loading] Correctly load params during offloading & careful memory considerations by @Cyrilvallez in #42632
allow registration of custom checkpoint conversion mappings by @winglian in #42634

New models:

Add FastVLM by @camilla-deckard in #41112
Lasr model by @eustlb in #42648
[Model] Add PaddleOCR-VL Model Support by @zhang-prog in #42178

Some notable quantization fixes:

Mostly added support for fbgemme , quanto,

Fix fp8 + some enhancement by @SunMarc in #42455
Fix eetq quanto quant methods by @SunMarc in #42557
[Quantization] per tensor quantization kernel by @MekkCyber in #42560
[Quantization] fix fbgemm by @MekkCyber in #42561
[Quantization] Fix FP8 experts replacing by @MekkCyber in #42654
[Quantization] Fix Static FP8 Quantization by @MekkCyber in #42775
[core] fix fp-quant by @MekkCyber in #42613

Peft:

The dynamic weight loader broke small things, this adds glue for all models but MoEs.

FIX Error when trying to load non-LoRA PEFT by @BenjaminBossan in #42663
Fix PEFT integration with new weight loader by @Cyrilvallez in #42701

Misc

Tokenization needed more refactoring, this time its a lot cleaner!

Refactor-tokenization-more by @ArthurZucker in #42563
Only default rope_parameters to empty dict if there is something to put in it by @hmellor in #42651

We omitted a lot of other commits for clarity, but thanks to everyone and the new contributors!