gh-146192: Add base32 support to binascii by kangtastic · Pull Request #146193 · python/cpython

kangtastic · 2026-03-20T08:00:51Z

Synopsis

Add base32 encoder and decoder functions implemented in C to binascii and use them to greatly improve the performance and reduce the memory usage of the existing base32 codec functions in base64.

No API or documentation changes are necessary with respect to any functions in base64, and all existing unit tests for those functions continue to pass without modification.

Resolves: gh-146192

Discussion

The base32-related functions in base64 are now wrappers for the new functions in binascii, as envisioned in the docs:

The binascii module contains a number of methods to convert between binary and various ASCII-encoded binary representations. Normally, you will not use these functions directly but use wrapper modules like uu or base64 instead. The binascii module contains low-level functions written in C for greater speed that are used by the higher-level modules.

Comments and questions are welcome.

Benchmarks

Benchmark script

# bench_b32.py

# Note: Can be EXTREMELY SLOW on unmodified mainline CPython.

import base64
import sys
import timeit
import tracemalloc

funcs = [(base64.b64encode, base64.b64decode), # sanity check/comparison
         (base64.b32encode, base64.b32decode),
         (base64.b32hexencode, base64.b32hexdecode)]

def mb(n):
    return f"{n / 1024 / 1024:.3f}"

def stats(func, data, t, m):
    name, n, bps = func.__qualname__, len(data), len(data) / t
    print(f"{name:<16}{n:<16}{t:<11.3f}{mb(bps):<13}{mb(m)}")

if __name__ == "__main__":
    print(f"Python {sys.version}\n")
    print(f"function        processed (b)   time (s)   avg (MB/s)   mem (MB)\n")
    data = b"a" * int(sys.argv[1]) * 1024 * 1024
    for fenc, fdec in funcs:
        tracemalloc.start()
        enc = fenc(data)
        menc = tracemalloc.get_traced_memory()[1] - len(enc)
        tracemalloc.stop()
        tenc = timeit.timeit("fenc(data)", number=1, globals=globals())
        stats(fenc, data, tenc, menc)

        tracemalloc.start()
        dec = fenc(enc)
        mdec = tracemalloc.get_traced_memory()[1] - len(dec)
        tracemalloc.stop()
        tdec = timeit.timeit("fdec(enc)", number=1, globals=globals())
        stats(fdec, enc, tdec, mdec)

Unmodified mainline CPython

$ ./python bench_b32.py 16
Python 3.15.0a7+ (heads/main:d357a7dbf38, Mar 19 2026, 23:22:25) [GCC 15.2.0]

function        processed (b)   time (s)   avg (MB/s)   mem (MB)

b64encode       16777216        0.015      1088.370     0.000
b64decode       22369624        0.017      1264.389     0.000
b32encode       16777216        2.308      6.933        17.382
b32decode       26843552        3.389      7.553        27.787
b32hexencode    16777216        2.338      6.843        17.379
b32hexdecode    26843552        3.388      7.557        27.787

With this PR

$ ./python bench_b32.py 16
Python 3.15.0a7+ (heads/base32-accel:72fd0f0302a, Mar 20 2026, 00:04:23) [GCC 15.2.0]

function        processed (b)   time (s)   avg (MB/s)   mem (MB)

b64encode       16777216        0.015      1084.957     0.000
b64decode       22369624        0.016      1363.524     0.000
b32encode       16777216        0.017      967.528      0.000
b32decode       26843552        0.016      1581.002     0.000
b32hexencode    16777216        0.016      995.277      0.000
b32hexdecode    26843552        0.016      1588.353     0.000

Encoding performance is improved by ~150x, decoding performance is improved by ~200x,
and no auxiliary memory is used.

📚 Documentation preview 📚: https://cpython-previews--146193.org.readthedocs.build/

Add base32 encoder and decoder functions implemented in C to `binascii` and use them to greatly improve the performance and reduce the memory usage of the existing base32 codec functions in `base64`. No API or documentation changes are necessary with respect to any functions in `base64`, and all existing unit tests for those functions continue to pass without modification. Resolves: pythongh-146192

serhiy-storchaka · 2026-03-20T15:15:33Z

You can now update your PR, @kangtastic.

kangtastic · 2026-03-20T15:50:58Z

@serhiy-storchaka Already on it 😄

- Use the new `alphabet` parameter in `binascii` - Remove `binascii.a2b_base32hex()` and `binascii.b2a_base32hex()` - Change value for `.. versionadded::` ReST directive in docs for new `binascii` functions to "next" instead of "3.15"

serhiy-storchaka

I added some suggestions, but the core LGTM.

Please add assertions for new alphabets in test_constants.

serhiy-storchaka · 2026-03-21T09:29:36Z

Doc/library/binascii.rst

+
+.. function:: b2a_base32(data, /, *, alphabet=BASE32_ALPHABET)
+
+   Convert binary data to a line(s) of ASCII characters in base32 coding,


It is a single line.

I will add wrapcol in a separate issue.

serhiy-storchaka · 2026-03-21T09:37:23Z

Doc/library/binascii.rst

+
+   Convert base32 data back to binary and return the binary data.
+
+   Valid base32 data:


This list is incomplete and redundant. I think it is better to follow the example of ascii85 and base85 (with a reference to the RFC). Mention that the mapping is case-sensitive and no optional mapping of the digit "0" and "1" to letters "O", "I" or "l" is used.

serhiy-storchaka · 2026-03-21T09:51:12Z

Doc/library/binascii.rst


+.. data:: BASE32_ALPHABET
+
+   The base32 alphabet according to :rfc:`4648`.


Suggested change

The base32 alphabet according to :rfc:`4648`.

The Base 32 alphabet according to :rfc:`4648`.

serhiy-storchaka · 2026-03-21T09:54:19Z

Doc/library/binascii.rst

+
+.. data:: BASE32HEX_ALPHABET
+
+   The "Extended Hex" base32hex alphabet according to :rfc:`4648`.


Suggested change

The "Extended Hex" base32hex alphabet according to :rfc:`4648`.

The "Extended Hex" Base 32 alphabet according to :rfc:`4648`.

These are the names used in the table 3 and 4 captions in RFC 4648.

Oh, we can even refer directly to the table:

Suggested change

The "Extended Hex" base32hex alphabet according to :rfc:`4648`.

The "Extended Hex" Base 32 alphabet according to :rfc:`4648`, table 4.

Add this also for Base 64 alphabets if you choose this variant.

I was wondering if this would come up. RFC 4648 uses all four of the terms "Base 32", "Base32", "base 32", and "base32" to refer to this encoding at various points, but it also states e.g.:

This encoding may be referred to as "base32hex". This encoding should not be regarded as the same as the "base32" encoding and should not be referred to as only "base32".

and e.g.:

One property with this alphabet, which the base64 and base32 alphabets lack...

thus implying that "base32" and "base32hex" are preferred, even if the rest of the document doesn't adhere to the implication.

Anyway, I'll refer to it as "Base 32" in docs for now to fit what's already there, and not reference the table number or touch any Base64 stuff so as to keep the scope of this PR limited.

serhiy-storchaka · 2026-03-21T10:06:42Z

Lib/base64.py

    if len(s) % 8:
        raise binascii.Error('Incorrect padding')


Should not this be handled in the C code?

serhiy-storchaka · 2026-03-21T10:18:44Z

Lib/base64.py

-        _b32rev[alphabet] = {v: k for k, v in enumerate(alphabet)}
+
+def _b32decode_prepare(s, casefold=False, map01=None):
    s = _bytes_from_decode_data(s)


This is only needed if map01 is not None.

Correction: it is also needed if casefold is true, for input like 'ß' or 'ﬃ'.

serhiy-storchaka · 2026-03-21T10:20:00Z

Lib/base64.py

-    if alphabet not in _b32rev:
-        _b32rev[alphabet] = {v: k for k, v in enumerate(alphabet)}
+
+def _b32decode_prepare(s, casefold=False, map01=None):


I suggest to inline this function. map01 handling is only needed for standard alphabet, and the code for casefold is trivial.

serhiy-storchaka · 2026-03-21T10:33:06Z

Modules/binascii.c

+    *
+    alphabet: Py_buffer(c_default="{NULL, NULL}") = BASE32_ALPHABET
+
+base32-code line of data.


Suggested change

base32-code line of data.

Base32-code line of data.

- Update docs to refer to "Base 32" and "Base32" - Update docs to better explain `binascii.a2b_base32()` - Inline helper function in `base64` - Add forgotten tests for presence of alphabet module globals

gpshead · 2026-03-22T05:25:38Z

Doc/library/binascii.rst

+
+.. data:: BASE32HEX_ALPHABET
+
+   The "Extended Hex" Base 32 alphabet according to :rfc:`4648`.


Maybe mention that this one maintains sort order when encoded.

gpshead · 2026-03-22T05:28:45Z

Doc/library/binascii.rst

+   Optional *alphabet* must be a :class:`bytes` object of length 32 which
+   specifies an alternative alphabet.
+
+   Invalid base32 data will raise :exc:`binascii.Error`.


How invalid? ex: I think our implementation has always ignored excess bits in the final character that don't map to a byte. We should be explicit about this (and cover the behavior in a test). I think it is reasonable to change our behavior to conform strictly to the "MAY choose to" in https://datatracker.ietf.org/doc/html/rfc4648#section-3.5 and raise the Error when there are excess bits in the input.

We should also have base64's strict_mode=True do the same.

(I do not think we need strict_mode for a2b_base32, we're always being strict here)

I think this is a separate issue. This PR does not change the behavior of existing decoder.

There are more important issues than checking for canonicity of input. They should be fixed first.

serhiy-storchaka

Please add also the What's New entry.

serhiy-storchaka · 2026-03-22T07:55:20Z

Doc/library/binascii.rst

+   Optional *alphabet* must be a :class:`bytes` object of length 32 which
+   specifies an alternative alphabet.
+
+   Invalid base32 data will raise :exc:`binascii.Error`.


I think this is a separate issue. This PR does not change the behavior of existing decoder.

There are more important issues than checking for canonicity of input. They should be fixed first.

serhiy-storchaka · 2026-03-22T08:04:57Z

Doc/library/binascii.rst

-   * Contains no excess data after padding (including excess padding, newlines, etc.).
-   * Does not start with padding.
+   .. note::
+      By default, this function does not map lowercase characters (which are


Remove "by default". There are no options for non-default behavior.

Wouldn't the user specifying an alternative alphabet be a non-default behavior?

It does not allow to map several characters to the same code.

It would work for folding 100% lowercase input. Maybe "By itself" instead of "By default"? But I don't mind removing it since it's such a rare hypothetical.

serhiy-storchaka · 2026-03-22T08:07:31Z

Lib/base64.py

-        _b32rev[alphabet] = {v: k for k, v in enumerate(alphabet)}
+
+def _b32decode_prepare(s, casefold=False, map01=None):
    s = _bytes_from_decode_data(s)


Correction: it is also needed if casefold is true, for input like 'ß' or 'ﬃ'.

Lib/test/test_binascii.py

- Revise docs - Add whatsnew entry - Minor whitespace change in tests

Referring to a group of 8 bytes as an "octet" may cause confusion, because the term is already commonly used in some languages to refer to a group of 8 bits (i.e. a byte). "Octa" is a suitable preexisting alternative for a group of 64 bits [1] (used by Knuth himself, at that). "Octad" was considered, but it, too, historically refers to a byte. Also rename "quintet" to "quint". "Pentad" was considered, but it historically refers to a group of 5 bits. [1] https://en.wikipedia.org/wiki/Units_of_information

serhiy-storchaka

LGTM. 👍

bedevere-app bot mentioned this pull request Mar 20, 2026

C accelerator for Base32 character encoding #146192

Open

serhiy-storchaka requested review from gpshead and serhiy-storchaka March 20, 2026 09:00

Update PR for python#145981

bf1308f

- Use the new `alphabet` parameter in `binascii` - Remove `binascii.a2b_base32hex()` and `binascii.b2a_base32hex()` - Change value for `.. versionadded::` ReST directive in docs for new `binascii` functions to "next" instead of "3.15"

kangtastic force-pushed the base32-accel branch from db96a3f to bf1308f Compare March 20, 2026 16:01

kangtastic marked this pull request as ready for review March 20, 2026 16:03

bedevere-app bot added the awaiting review label Mar 20, 2026

serhiy-storchaka reviewed Mar 21, 2026

View reviewed changes

kangtastic added 2 commits March 21, 2026 07:56

Address reviewer feedback

a9a7d26

- Update docs to refer to "Base 32" and "Base32" - Update docs to better explain `binascii.a2b_base32()` - Inline helper function in `base64` - Add forgotten tests for presence of alphabet module globals

Update generated files

6f80c54

gpshead reviewed Mar 22, 2026

View reviewed changes

serhiy-storchaka reviewed Mar 22, 2026

View reviewed changes

kangtastic added 2 commits March 22, 2026 02:20

Address more reviewer feedback

4c82070

- Revise docs - Add whatsnew entry - Minor whitespace change in tests

kangtastic requested a review from AA-Turner as a code owner March 22, 2026 09:43

serhiy-storchaka approved these changes Mar 22, 2026

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting review labels Mar 22, 2026


		.. function:: b2a_base32(data, /, *, alphabet=BASE32_ALPHABET)

		Convert binary data to a line(s) of ASCII characters in base32 coding,


		Convert base32 data back to binary and return the binary data.

		Valid base32 data:


		.. data:: BASE32_ALPHABET

		The base32 alphabet according to :rfc:`4648`.

	The base32 alphabet according to :rfc:`4648`.
	The Base 32 alphabet according to :rfc:`4648`.


		.. data:: BASE32HEX_ALPHABET

		The "Extended Hex" base32hex alphabet according to :rfc:`4648`.

	The "Extended Hex" base32hex alphabet according to :rfc:`4648`.
	The "Extended Hex" Base 32 alphabet according to :rfc:`4648`.

Uh oh!

Conversation

kangtastic commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Synopsis

Discussion

Benchmarks

Benchmark script

Unmodified mainline CPython

With this PR

Uh oh!

serhiy-storchaka commented Mar 20, 2026

Uh oh!

kangtastic commented Mar 20, 2026

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kangtastic Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kangtastic commented Mar 20, 2026 •

edited

Loading

kangtastic Mar 21, 2026 •

edited

Loading