Skip to content

Command-a-vision fix#42642

Merged
zucchini-nlp merged 4 commits intohuggingface:mainfrom
dongluw:command_a_vision_fix_1
Dec 10, 2025
Merged

Command-a-vision fix#42642
zucchini-nlp merged 4 commits intohuggingface:mainfrom
dongluw:command_a_vision_fix_1

Conversation

@dongluw
Copy link
Contributor

@dongluw dongluw commented Dec 5, 2025

What does this PR do?

  • fix a bug in the image resize code, which affects performance
    before fix:
patch_01

after fix:
patch_1

  • add test cases

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@dongluw dongluw force-pushed the command_a_vision_fix_1 branch from 4c93a19 to 2450d3d Compare December 5, 2025 04:58
@dongluw dongluw force-pushed the command_a_vision_fix_1 branch from 2450d3d to 3066589 Compare December 5, 2025 05:02
@Rocketknight1
Copy link
Member

cc @molbap @yonigozlan @zucchini-nlp

@molbap
Copy link
Contributor

molbap commented Dec 5, 2025

Hi @dongluw , thanks for the contribution, indeed there seems to be a flip issue. however how do you obtain the test image above? I'm surprised our tests haven't caught this, it should be all wrong, so a reproducer would help

@dongluw
Copy link
Contributor Author

dongluw commented Dec 5, 2025

hey @molbap I saved the stacked_images to images patch by patch https://github.com/dongluw/transformers/blob/9ef13ef775fe5a05c634fb2705a500ef59f28763/src/transformers/models/cohere2_vision/image_processing_cohere2_vision_fast.py#L228

this issue only affects generation quality if images are of very high/low aspect ratio

full image is
output_image17

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey, nice catch! I was adapting the processing from GotOCR and misplaced the sizes for (h, w), commented it below. I think we need to fix where aspect ratios are computed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue is actually here because the grids come in (w, h) format but the original size is in (h, w) format. We need to swap the original size format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the (h, w) order of original_image_size is derived from the input https://github.com/dongluw/transformers/blob/cfef59b0d012002cea6ee16e7b68d2e9af0a4f44/src/transformers/models/cohere2_vision/image_processing_cohere2_vision_fast.py#L167-L169

IIUC it would make more sense to change the the function call above, since the input has order (original_height, original_width) while the output expects num_columns, num_rows, which is flipped

If you can point me to where this part is generated from, I can try to make a change there instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems the function call is generated from GotOcr2ImageProcessorFast
https://github.com/dongluw/transformers/blob/cfef59b0d012002cea6ee16e7b68d2e9af0a4f44/src/transformers/models/got_ocr2/image_processing_got_ocr2_fast.py#L93-L95

however modifying this class would probably affect other models, so I think we can keep the (h, w) order in this pr.

btw the grids is list of symmetric tuples that doesn't assume specific dim order, the fix of flipping dim order at the return statement still needs to be there IMO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the grids do not assume specific order, The issue is that we need to choose one layout and follow it for consistency. And since the naming suggests that (w, h) layout is used in grids, I prefer to keep it consistent. Currently grids assume (w, h) and the number of columns also assume that layout is (w, h). The issue is in original size not following the same format and thus messing with aspect ratios

So for original_size = np.stack([image_width, image_height]) looks an easier approach to me, instead of having to rename more variables for general consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay improved the naming

Comment on lines +194 to +201
def test_crop_to_patches_aspect_ratio(self):
"""Test that row/column ordering is correct when cropping non-square images to patches.

This test verifies that patches can be stitched back to reconstruct the original image,
which validates that the row/column ordering in get_optimal_tiled_canvas is correct.
If row/column are swapped, the image would be resized to wrong dimensions and patches
would not match the original content.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding a test!

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: cohere2_vision

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating

Comment on lines +298 to +300
# tiles following (width, height) order to align with aspect ratio convention
tile_size = np.stack([image_width, image_height])
required_scales = candidate_resolutions / tile_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, thanks for explicitly commenting out

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp merged commit 1b8ccf1 into huggingface:main Dec 10, 2025
16 checks passed
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
* Add test case and update image processing

* Apply suggestions from code review

* improve naming
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants