Command-a-vision fix by dongluw · Pull Request #42642 · huggingface/transformers

dongluw · 2025-12-05T04:48:42Z

What does this PR do?

fix a bug in the image resize code, which affects performance
before fix:

after fix:

add test cases

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Rocketknight1 · 2025-12-05T14:35:21Z

cc @molbap @yonigozlan @zucchini-nlp

molbap · 2025-12-05T17:58:14Z

Hi @dongluw , thanks for the contribution, indeed there seems to be a flip issue. however how do you obtain the test image above? I'm surprised our tests haven't caught this, it should be all wrong, so a reproducer would help

dongluw · 2025-12-05T19:39:31Z

hey @molbap I saved the stacked_images to images patch by patch https://github.com/dongluw/transformers/blob/9ef13ef775fe5a05c634fb2705a500ef59f28763/src/transformers/models/cohere2_vision/image_processing_cohere2_vision_fast.py#L228

this issue only affects generation quality if images are of very high/low aspect ratio

full image is

src/transformers/models/cohere2_vision/image_processing_cohere2_vision_fast.py

src/transformers/models/cohere2_vision/modular_cohere2_vision.py

zucchini-nlp

hey, nice catch! I was adapting the processing from GotOCR and misplaced the sizes for (h, w), commented it below. I think we need to fix where aspect ratios are computed

zucchini-nlp · 2025-12-08T09:32:13Z

src/transformers/models/cohere2_vision/image_processing_cohere2_vision_fast.py

the issue is actually here because the grids come in (w, h) format but the original size is in (h, w) format. We need to swap the original size format

the (h, w) order of original_image_size is derived from the input https://github.com/dongluw/transformers/blob/cfef59b0d012002cea6ee16e7b68d2e9af0a4f44/src/transformers/models/cohere2_vision/image_processing_cohere2_vision_fast.py#L167-L169

IIUC it would make more sense to change the the function call above, since the input has order (original_height, original_width) while the output expects num_columns, num_rows, which is flipped

If you can point me to where this part is generated from, I can try to make a change there instead

seems the function call is generated from GotOcr2ImageProcessorFast
https://github.com/dongluw/transformers/blob/cfef59b0d012002cea6ee16e7b68d2e9af0a4f44/src/transformers/models/got_ocr2/image_processing_got_ocr2_fast.py#L93-L95

however modifying this class would probably affect other models, so I think we can keep the (h, w) order in this pr.

btw the grids is list of symmetric tuples that doesn't assume specific dim order, the fix of flipping dim order at the return statement still needs to be there IMO

yeah, the grids do not assume specific order, The issue is that we need to choose one layout and follow it for consistency. And since the naming suggests that (w, h) layout is used in grids, I prefer to keep it consistent. Currently grids assume (w, h) and the number of columns also assume that layout is (w, h). The issue is in original size not following the same format and thus messing with aspect ratios

So for original_size = np.stack([image_width, image_height]) looks an easier approach to me, instead of having to rename more variables for general consistency

okay improved the naming

zucchini-nlp · 2025-12-08T09:32:25Z

tests/models/cohere2_vision/test_image_processing_cohere2_vision.py

+    def test_crop_to_patches_aspect_ratio(self):
+        """Test that row/column ordering is correct when cropping non-square images to patches.
+
+        This test verifies that patches can be stitched back to reconstruct the original image,
+        which validates that the row/column ordering in get_optimal_tiled_canvas is correct.
+        If row/column are swapped, the image would be resized to wrong dimensions and patches
+        would not match the original content.
+        """


thanks for adding a test!

github-actions · 2025-12-09T16:28:00Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: cohere2_vision

zucchini-nlp

LGTM, thanks for iterating

zucchini-nlp · 2025-12-10T13:18:39Z

src/transformers/models/cohere2_vision/modular_cohere2_vision.py

+    # tiles following (width, height) order to align with aspect ratio convention
+    tile_size = np.stack([image_width, image_height])
+    required_scales = candidate_resolutions / tile_size


great, thanks for explicitly commenting out

HuggingFaceDocBuilderDev · 2025-12-10T13:27:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* Add test case and update image processing * Apply suggestions from code review * improve naming

dongluw force-pushed the command_a_vision_fix_1 branch from 4c93a19 to 2450d3d Compare December 5, 2025 04:58

Add test case and update image processing

3066589

dongluw force-pushed the command_a_vision_fix_1 branch from 2450d3d to 3066589 Compare December 5, 2025 05:02

Merge branch 'main' into command_a_vision_fix_1

9ef13ef

dongluw commented Dec 7, 2025

View reviewed changes

src/transformers/models/cohere2_vision/image_processing_cohere2_vision_fast.py Outdated Show resolved Hide resolved

dongluw commented Dec 7, 2025

View reviewed changes

src/transformers/models/cohere2_vision/modular_cohere2_vision.py Outdated Show resolved Hide resolved

Apply suggestions from code review

cfef59b

zucchini-nlp reviewed Dec 8, 2025

View reviewed changes

improve naming

27b2fa9

zucchini-nlp approved these changes Dec 10, 2025

View reviewed changes

zucchini-nlp merged commit 1b8ccf1 into huggingface:main Dec 10, 2025
16 checks passed

SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026

Command-a-vision fix (huggingface#42642)

c975834

* Add test case and update image processing * Apply suggestions from code review * improve naming

Conversation

dongluw commented Dec 5, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Dec 5, 2025

Uh oh!

molbap commented Dec 5, 2025

Uh oh!

dongluw commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

dongluw Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

dongluw Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

dongluw Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dongluw commented Dec 5, 2025 •

edited

Loading