Fix fp8 + some enhancement by SunMarc · Pull Request #42455 · huggingface/transformers

SunMarc · 2025-11-27T14:18:59Z

What does this PR do?

This PR fixes a bunch of code related to fp8 + some enhancement to make the code simpler to maintain.
Related issue #42442
Thanks to @YangKai0616 for spotting those.

Co-authored-by: Yang Kai <kai.yang@intel.com>

HuggingFaceDocBuilderDev · 2025-11-27T14:31:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

YangKai0616 · 2025-11-28T05:51:35Z

There's one more question:
when running the tests/quantization/finegrained_fp8/test_fp8.py::FP8QuantizerTest::test_quantized_model_multi_accelerator test, we get result AssertionError: False is not true. The root cause is that when setUpClass loads the FP8 quantized model, PyTorch's caching allocator reserves significantly more reserved memory than the final allocated memory. As a result, Accelerate calculates a larger unused_memory, causing the entire model to be placed on GPU0. But this behavior actually seems reasonable.

For this test, should we explicitly add:

self.__class__.quantized_model = None
backend_empty_cache(torch_device)

Or, considering the case test_save_pretrained_multi_accelerators, should we add , max_memory={0: "3GB", 1: "3GB"} to both of them?

Could I have your thoughts on this? Thanks！

I wrote a simple reproduction script to observe this situation:

import torch
from transformers import FineGrainedFP8Config, AutoModelForCausalLM
from transformers.testing_utils import (
    backend_empty_cache,
    torch_device,
)

if __name__ == "__main__":
    print(f"torch.cuda.memory_reserved(0) before loading model: {torch.cuda.memory_reserved(0)}, torch.cuda.memory_allocated(0): {torch.cuda.memory_allocated(0)}")
    print(f"torch.cuda.memory_reserved(1) before loading model: {torch.cuda.memory_reserved(1)}, torch.cuda.memory_allocated(1): {torch.cuda.memory_allocated(1)}")
    quantization_config = FineGrainedFP8Config()
    quantized_model = AutoModelForCausalLM.from_pretrained(
            "meta-llama/Llama-3.2-1B", device_map=torch_device, quantization_config=quantization_config
        )
    print(f"torch.cuda.memory_reserved(0) after loading model: {torch.cuda.memory_reserved(0)}, torch.cuda.memory_allocated(0): {torch.cuda.memory_allocated(0)}")
    print(f"torch.cuda.memory_reserved(1) after loading model: {torch.cuda.memory_reserved(1)}, torch.cuda.memory_allocated(1): {torch.cuda.memory_allocated(1)}")

    print("##################################################################################################################################################################")
    quantized_model = None
    backend_empty_cache(torch_device)


    print(f"torch.cuda.memory_reserved(0) before loading model: {torch.cuda.memory_reserved(0)}, torch.cuda.memory_allocated(0): {torch.cuda.memory_allocated(0)}")
    print(f"torch.cuda.memory_reserved(1) before loading model: {torch.cuda.memory_reserved(1)}, torch.cuda.memory_allocated(1): {torch.cuda.memory_allocated(1)}")
    model = AutoModelForCausalLM.from_pretrained(
            "meta-llama/Llama-3.2-1B", device_map=torch_device, dtype=torch.float32
        )
    print(f"torch.cuda.memory_reserved(0) after loading model: {torch.cuda.memory_reserved(0)}, torch.cuda.memory_allocated(0): {torch.cuda.memory_allocated(0)}")
    print(f"torch.cuda.memory_reserved(1) after loading model: {torch.cuda.memory_reserved(1)}, torch.cuda.memory_allocated(1): {torch.cuda.memory_allocated(1)}")

The script output is:

torch.cuda.memory_reserved(0) before loading model: 0, torch.cuda.memory_allocated(0): 0
torch.cuda.memory_reserved(1) before loading model: 0, torch.cuda.memory_allocated(1): 0
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████| 146/146 [00:00<00:00, 218.13it/s, Materializing param=model.norm.weight]
torch.cuda.memory_reserved(0) after loading model: 6054477824, torch.cuda.memory_allocated(0): 2024268288
torch.cuda.memory_reserved(1) after loading model: 0, torch.cuda.memory_allocated(1): 0
##################################################################################################################################################################
torch.cuda.memory_reserved(0) before loading model: 0, torch.cuda.memory_allocated(0): 0
torch.cuda.memory_reserved(1) before loading model: 0, torch.cuda.memory_allocated(1): 0
Loading weights: 100%|██████████████████████████████████████████████████████████████████████████| 146/146 [00:00<00:00, 226.73it/s, Materializing param=model.norm.weight]
torch.cuda.memory_reserved(0) after loading model: 4947181568, torch.cuda.memory_allocated(0): 4943258112
torch.cuda.memory_reserved(1) after loading model: 0, torch.cuda.memory_allocated(1): 0

SunMarc · 2025-11-28T10:54:47Z

There's one more question:
when running the tests/quantization/finegrained_fp8/test_fp8.py::FP8QuantizerTest::test_quantized_model_multi_accelerator test, we get result AssertionError: False is not true. The root cause is that when setUpClass loads the FP8 quantized model, PyTorch's caching allocator reserves significantly more reserved memory than the final allocated memory. As a result, Accelerate calculates a larger unused_memory, causing the entire model to be placed on GPU0. But this behavior actually seems reasonable.

Any idea where the Pytorch catching allocator happens ? We have our own caching allocator but it happens after _get_device_map. Btw, our caching allocator needs some fix as we changed a bit the modeling of the fp8 method

YangKai0616 · 2025-11-28T11:53:54Z

Any idea where the Pytorch catching allocator happens ? We have our own caching allocator but it happens after _get_device_map. Btw, our caching allocator needs some fix as we changed a bit the modeling of the fp8 method

Sorry for the confusion. Regarding torch, I was referring to here. Understood, I'll wait for the fix. Thanks!

SunMarc · 2025-11-28T15:09:44Z

btw @YangKai0616, even when setting _dtype = torch.float32, I don't get the expected output. Can you try this PR to see what results you get ? Even with older version of transformers, I get
Once upon a time, there was a little girl who loved to play

YangKai0616 · 2025-11-28T16:41:46Z

btw @YangKai0616, even when setting _dtype = torch.float32, I don't get the expected output. Can you try this PR to see what results you get ? Even with older version of transformers, I get Once upon a time, there was a little girl who loved to play

Using this PR, I can get the expected output as follows:

================================================================================ FAILURES ================================================================================
_________________________________________________________________ FP8QuantizerTest.test_quantized_model __________________________________________________________________

self = <finegrained_fp8.test_fp8.FP8QuantizerTest testMethod=test_quantized_model>

    def test_quantized_model(self):
        """
        Simple test that checks if the quantized model is working properly
        """
        input_ids = self.tokenizer(self.input_text, return_tensors="pt").to(self.device_map)
    
        output = self.quantized_model.generate(**input_ids, max_new_tokens=self.max_new_tokens, do_sample=False)
        output_tokens = self.tokenizer.decode(output[0], skip_special_tokens=True)
>       self.assertEqual(output_tokens, self.EXPECTED_OUTPUT)

tests/quantization/finegrained_fp8/test_fp8.py:159: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/transformers/testing_utils.py:651: in wrapper
    return test_case(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
src/transformers/testing_utils.py:651: in wrapper
    return test_case(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
src/transformers/testing_utils.py:651: in wrapper
    return test_case(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
E   AssertionError: 'Once upon a time, there was a man who was very rich.' != 'Once upon a time, there was a little girl who loved to play'
E   - Once upon a time, there was a man who was very rich.
E   + Once upon a time, there was a little girl who loved to play

My testing environment is:

transformers 5.0.0.dev0 # branch fix-fp8
torch 2.9.1+xpu
2 cards Intel(R) Data Center GPU Max 1550

But I don't have a 4090 or H100, so I can't test the CUDA performance...

SunMarc · 2025-11-28T16:56:07Z

Thanks for confirming that it works on your hardware ! I will update it so that it doesn't fail on your side too

SunMarc · 2025-11-28T16:57:38Z

for the multi-gpu tests, I will probably fix this in a follow-up PR as I will need to update a lot of methods

ArthurZucker

Very thanks for the cleanup!

MekkCyber

Very nice cleanup lgtm ! just few typos in comments

src/transformers/integrations/finegrained_fp8.py

src/transformers/core_model_loading.py

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>

github-actions · 2025-12-02T14:25:11Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: finegrained_fp8, mxfp4

* Fix fp8 + some enhancement * style * Add coauthor Co-authored-by: Yang Kai <kai.yang@intel.com> * fix * style * fix tests * style * assertin * style * fix * fix * Apply suggestions from code review Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> --------- Co-authored-by: Yang Kai <kai.yang@intel.com> Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>

YangKai0616 · 2025-12-09T08:49:15Z

for the multi-gpu tests, I will probably fix this in a follow-up PR as I will need to update a lot of methods

Hey @SunMarc ,would you mind letting me know if there’s any plan to fix the multi-gpu testing issue anytime soon? Thank you very much for your help!

SunMarc · 2025-12-16T11:04:48Z

Hey @YangKai0616, I had time to investigate and this will be fixed here ! #42818

* Fix fp8 + some enhancement * style * Add coauthor Co-authored-by: Yang Kai <kai.yang@intel.com> * fix * style * fix tests * style * assertin * style * fix * fix * Apply suggestions from code review Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> --------- Co-authored-by: Yang Kai <kai.yang@intel.com> Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>

Fix fp8 + some enhancement

2388fe7

SunMarc requested a review from MekkCyber November 27, 2025 14:19

SunMarc and others added 2 commits November 27, 2025 14:20

style

260f7c5

Add coauthor

88f9788

Co-authored-by: Yang Kai <kai.yang@intel.com>

fix

e2c8e5b

SunMarc added 3 commits November 28, 2025 15:27

style

1ee8e5e

fix tests

13620e0

style

3ac9276

SunMarc added 2 commits November 28, 2025 16:50

Merge remote-tracking branch 'origin/HEAD' into fix-fp8

2123ed4

assertin

f30a51e

style

aea1af6

ArthurZucker approved these changes Dec 1, 2025

View reviewed changes

SunMarc added 4 commits December 2, 2025 13:41

Merge remote-tracking branch 'origin/main' into fix-fp8

116bc3a

Merge remote-tracking branch 'origin/main' into fix-fp8

1088d40

fix

72af386

fix

e3f53d2

MekkCyber approved these changes Dec 2, 2025

View reviewed changes

src/transformers/integrations/finegrained_fp8.py Outdated Show resolved Hide resolved

src/transformers/core_model_loading.py Outdated Show resolved Hide resolved

Apply suggestions from code review

a5e57cc

Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>

ArthurZucker merged commit bc7a268 into main Dec 2, 2025
21 of 24 checks passed

ArthurZucker deleted the fix-fp8 branch December 2, 2025 14:42

Aratako mentioned this pull request Dec 2, 2025

[Bug] Ministral 3 model loading fails with TypeError after PR #42455 #42565

Closed

4 tasks

evanqhuang mentioned this pull request Feb 24, 2026

Fix vision tower dtype mismatch with BitsAndBytes quantization 1038lab/ComfyUI-JoyCaption#53

Open

Conversation

SunMarc commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Nov 27, 2025

Uh oh!

YangKai0616 commented Nov 28, 2025

Uh oh!

SunMarc commented Nov 28, 2025

Uh oh!

YangKai0616 commented Nov 28, 2025

Uh oh!

SunMarc commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YangKai0616 commented Nov 28, 2025

Uh oh!

SunMarc commented Nov 28, 2025

Uh oh!

SunMarc commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

MekkCyber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

Uh oh!

YangKai0616 commented Dec 9, 2025

Uh oh!

SunMarc commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SunMarc commented Nov 27, 2025 •

edited

Loading

SunMarc commented Nov 28, 2025 •

edited

Loading

SunMarc commented Nov 28, 2025 •

edited

Loading