[docs] deploying by stevhliu · Pull Request #43241 · huggingface/transformers

stevhliu · 2026-01-13T02:07:00Z

adds ecosystem integration docs for deploying with Candle, ExecuTorch, and MLX

HuggingFaceDocBuilderDev · 2026-01-13T02:16:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stevhliu · 2026-01-13T02:26:44Z

docs/source/en/community_integrations/candle.md

+
+-->
+
+# Candle


@ivarflakstad, would you mind reviewing the candle integration doc please? the goal is to explain and demonstrate how candle uses Transformers

Looks good to me!
The docs (https://huggingface.github.io/candle) are very outdated at this point though.
Much of the information is still correct, but still I'd rather we directed users to the candle readme.

stevhliu · 2026-01-13T02:27:17Z

docs/source/en/community_integrations/mlx.md

+
+-->
+
+# MLX


@pcuenca, would you mind reviewing the MLX integration doc please? the goal is to explain and demonstrate how MLX uses Transformers

pcuenca

Took a quick look at the MLX section, made a few comments and suggested to add the MLX -> transformers integration, but found some problems while testing; will take a deeper look.

I'll review the rest of the sections later.

docs/source/en/community_integrations/mlx.md

pcuenca · 2026-01-15T12:43:45Z

docs/source/en/community_integrations/mlx.md

+)
+print(output)
+```
+


Suggested change

Conversely, you can also load and run MLX-converted weights in Transformers, potentially on different platforms:

```py

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "pcuenq/tiny-llama-chat-mlx"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0")

messages = [

{"role": "user", "content": "What is the capital of France?"},

]

inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(inputs["input_ids"].to(model.device), do_sample=False, max_new_tokens=100)

print(tokenizer.decode(outputs[0].to("cpu")))

Heads up: we may want to skip this for now as I'm finding friction finding checkpoints that work - incompatible quants or weight shapes. Will look into it a bit.

GitHub got confused with the nested quotes.

hey @pcuenca, lmk if its ok to skip this example for now so we can merge this. happy to follow up on this in a future PR :)

Hi @stevhliu, sorry I dropped the ball here! Yes, let's skip for now and get this out!

docs/source/en/community_integrations/mlx.md

LysandreJik

Cool thank you! cc @ArthurZucker on the Executorch part

* fix * feedback * fix

stevhliu commented Jan 13, 2026

View reviewed changes

stevhliu requested a review from LysandreJik January 13, 2026 02:27

pcuenca reviewed Jan 15, 2026

View reviewed changes

docs/source/en/community_integrations/mlx.md Outdated Show resolved Hide resolved

stevhliu force-pushed the deploy branch from 5db3420 to 982f632 Compare January 16, 2026 19:45

stevhliu added 2 commits January 30, 2026 13:21

fix

8225c36

feedback

8cc2eb6

stevhliu force-pushed the deploy branch from 982f632 to 8cc2eb6 Compare January 30, 2026 21:24

fix

b6f90bc

Sebmono mentioned this pull request Feb 1, 2026

[PR] [docs] deploying Sandgarden-Demo/transformers#19

Open

LysandreJik approved these changes Feb 5, 2026

View reviewed changes

stevhliu merged commit b92f8ff into huggingface:main Feb 5, 2026
15 checks passed

stevhliu deleted the deploy branch February 5, 2026 16:36

jiosephlee pushed a commit to jiosephlee/transformers_latest that referenced this pull request Feb 11, 2026

[docs] deploying (huggingface#43241)

8ee72cf

* fix * feedback * fix

+Conversely, you can also load and run MLX-converted weights in Transformers, potentially on different platforms:
+  ```py
+  from transformers import AutoModelForCausalLM, AutoTokenizer
+  model_id = "pcuenq/tiny-llama-chat-mlx"
+  tokenizer = AutoTokenizer.from_pretrained(model_id)
+  model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0")
+  messages = [
+      {"role": "user", "content": "What is the capital of France?"},
+  ]
+  inputs = tokenizer(prompt, return_tensors="pt")
+  outputs = model.generate(inputs["input_ids"].to(model.device), do_sample=False, max_new_tokens=100)
+  print(tokenizer.decode(outputs[0].to("cpu")))


		-->

		# Candle


		-->

		# MLX

Conversation

stevhliu commented Jan 13, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jan 13, 2026

Uh oh!

stevhliu Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

ivarflakstad Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

stevhliu Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcuenca Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

pcuenca Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

stevhliu Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

pcuenca Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants