ft_utils

Image

Fine-Grained Parallelism with Free-Threaded Python and ft_utils

Legacy Python, the GIL, and Structural Limitations

The Global Interpreter Lock (GIL) in CPython has long been a pragmatic compromise. It protects memory management internals from concurrent access at the cost of serializing all Python bytecode execution. This single lock simplifies reference counting and garbage collection, but it has made true multi-core CPU utilization within one process impossible.

Even in C extensions, while Py_BEGIN_ALLOW_THREADS enables the GIL to be dropped, any interaction with Python objects—reference counts, attribute access, even isinstance checks—requires reacquiring it. This fragments performance and forces programmers into a mental model where concurrent control and Python data manipulation are mutually exclusive.

The GIL also has non-obvious side effects:

The result is that GIL-based Python gives the appearance of concurrency while hiding deep systemic serialization. Most high-performance developers route around it—via multiprocessing, C++, or GPU offload—each with its own overhead and disconnect from Python ergonomics.

Free-Threaded Python (FTP): Removing the GIL

FTP (introduced experimentally in CPython 3.13 and maturing in 3.14) removes the GIL and introduces per-object locking semantics to enable real concurrency.

Key concepts

Rather than reintroduce coarse locks, the FTP model shifts toward critical sections, atomic operations, and lock elision strategies. This opens the door for performance models much closer to what C++ and Rust developers expect.

ft_utils: Fine-Grained Control for the Free-Threaded World

ft_utils, provides infrastructure for working with FTP in production contexts. It includes:

This is not abstraction for abstraction’s sake but tools to allow scalable development with exact, fine grained and easy to use control over thread based parallel execution architectures.

AtomicInt64 and AtomicReference

AtomicInt64 provides lock-free atomic manipulation of integer state:

from ft_utils import AtomicInt64
counter = AtomicInt64(0)
# Multiple threads can safely increment:
counter.fetch_add(1)
# Or using arithmetic operators
counter += 1

In the absence of the GIL, this matters: naive += 1 on an int is now unsafe without explicit synchronization.

AtomicReference generalizes this to object references. It enables low-level constructs like lock-free queues, hazard pointers, or generational GC barriers, depending on your architecture.

AtomicFlag is a bool abstraction over AtomicInt64.

These atomic ops are implemented in C using platform-native intrinsics, giving predictable, memory-fenced semantics consistent with modern concurrent programming.

CPython New Features

Atomics

Access to many atomic native operations (for example _Py_atomic_add_uint64_t) have been added to the CPython API. ft_utils provides ft_compat.h which backports these to previous versions of CPython to make cross version coding easier.

Critical Sections: Per-Object Locking

Critical sections in FTP are explicit per-object locks that allow serial access to shared state. In Cython:

with cython.critical_section(myobj):
    # Safe access

Under the hood, each Python object now carries an optional lock. This unlocks fine-grained synchronization models—reader-writer patterns, lock striping, and even lock-free algorithms with fallback pessimism.

Unlike threading.Lock, critical sections are:

Critical Sections vs Mutexes in Free-Threaded Python

Overview

Critical sections and mutexes both provide mutual exclusion to protect shared resources from concurrent access. However, they differ in scope, granularity, performance, and implementation strategy, particularly in the context of Free-Threaded Python (FTP).

Conceptual Difference

Scope and Granularity

Performance

Integration with Python Semantics

Example Comparison

with cython.critical_section(my_object):
    my_object.value += 1 # safe from race conditions

The below is from the batch executor source code in ft_utils where a critical sections protects the refilling of the buffer. Note how the critical section protects the excution on a per-object basis compare to a mutex which would just be one a code block basis.

    Py_BEGIN_CRITICAL_SECTION(self);
    index = _Py_atomic_load_ssize(&(self->index));
    if (index < size) {
      err = 0;
    } else {
      err = BatchExecutorObject_fill_buffer(self);
    }
    Py_END_CRITICAL_SECTION();
lock = threading.Lock()
with lock:
    my_object.value += 1 # safe, but scope and ownership are not enforced

Design Philosophy

Summary Table

Feature Mutex Critical Section (FTP)
Scope Arbitrary Tied to Python objects
Performance OS/kernel-level (slower) Fast user-space, object-specific
Python Awareness No Yes
Deadlock Risk Higher Lower (if per-object)
Use Case Manual general locking Fine-grained Python object protection
Default in FTP No Yes

Understanding The GIL

The GIL Is Not Thread Safe!

The Global Interpreter Lock (GIL) is a mechanism used in CPython, the standard implementation of the Python programming language, to synchronize access to Python objects, preventing multiple native threads from executing Python bytecodes at once. This lock is necessary primarily because CPython’s memory management is not thread-safe.

Thread safety refers to the ability of a program or a piece of code to behave correctly when accessed by multiple threads. Achieving thread safety is crucial in multithreaded environments where threads share the same memory space and resources. The challenges in ensuring thread safety include preventing race conditions, deadlocks, and other concurrency-related issues.

The GIL impacts the execution of threads in Python by allowing only one thread to execute Python bytecodes at a time. This means that for CPU-bound threads (those that spend most of their time performing computations), the GIL can significantly limit the benefits of multithreading because it effectively serializes the execution of these threads. However, for I/O-bound threads (those that spend most of their time waiting on I/O operations like reading from a file or network), the GIL is released during the I/O operation, allowing other threads to run.

Despite its role in simplifying certain aspects of Python’s threading implementation, the GIL does not make Python code thread-safe. The GIL is released during certain operations like I/O, and even when it is held, operations that appear atomic can still be interrupted. For example, incrementing a counter (x += 1) is not atomic; it involves reading the current value, incrementing it, and writing it back; that might then also cause code to run due to properties and all this might change as code evolves. If multiple threads are doing this concurrently, the GIL might be released between these steps, or the thread might be interrupted, leading to a race condition.

Here’s an example that demonstrates how the GIL does not prevent race conditions:

import threading

class Counter:
    """Contains a reference to a counted value"""
    def __init__(self) -> None:
        self._counted = 0

    @property
    def counted(self) -> int:
        return self._counted

    @counted.setter
    def counted(self, value: int) -> None:
        self._counted = value

def increment_counter(counter: Counter, num_times: int, barrier: threading.Barrier) -> None:
    """Increments the counter 'num_times' times after waiting on the barrier."""
    barrier.wait()
    for _ in range(num_times):
        counter.counted += 1

def main() -> None:
    """Runs multiple threads to increment a counter and checks for correctness."""

    num_threads = 10
    num_increments = 50000
    iterations = 0
    while True:
        counter = Counter()
        barrier = threading.Barrier(num_threads)

        threads = []
        for _ in range(num_threads):
            thread = threading.Thread(target=increment_counter, args=(counter, num_increments, barrier))
            threads.append(thread)
            thread.start()

        for thread in threads:
            thread.join()

        expected = num_threads * num_increments
        print(f"{iterations}-> Expected: {expected} Actual: {counter.counted}", flush=True)
        if counter.counted != expected:
            return
        iterations += 1

if __name__ == "__main__":
    main()

Running this code, you’ll likely find that the actual count is less than the expected count due to the race condition in incrementing the counter. Different values for num_increments may or may not trigger this behaviour. Similarly, running the code on different machines may impact results. So, code might work as though it is thread safe with the GIL but in relality there is no guarantee; code which works today might suddenly break tomorrow.

To achieve thread safety in Python, developers must use synchronization primitives like locks (threading.Lock), queues (queue.Queue), or other concurrency control mechanisms. For example, using a lock to protect the counter increment operation:

import threading

class Counter:
    """Contains a reference to a counted value"""
    def __init__(self) -> None:
        self._counted = 0

    @property
    def counted(self) -> int:
        return self._counted

    @counted.setter
    def counted(self, value: int) -> None:
        self._counted = value

def increment_counter(counter: Counter, num_times: int, barrier: threading.Barrier, lock: threading.Lock) -> None:
    """Increments the counter 'num_times' times after waiting on the barrier."""
    barrier.wait()
    for _ in range(num_times):
        # Putting the lock around the entire loop is more efficient.
        # Putting it here is a clearer demonstration of the concept.
        with lock:
            counter.counted += 1

def main() -> None:
    """Runs multiple threads to increment a counter and checks for correctness."""

    num_threads = 10
    num_increments = 50000
    iterations = 0
    lock = threading.Lock()
    while True:
        counter = Counter()
        barrier = threading.Barrier(num_threads)

        threads = []
        for _ in range(num_threads):
            thread = threading.Thread(target=increment_counter, args=(counter, num_increments, barrier, lock))
            threads.append(thread)
            thread.start()

        for thread in threads:
            thread.join()

        expected = num_threads * num_increments
        print(f"{iterations}-> Expected: {expected} Actual: {counter.counted}", flush=True)
        if counter.counted != expected:
            return
        iterations += 1

if __name__ == "__main__":
    main()

This version of the code ensures that the counter is incremented correctly, even with multiple threads.

SO, while the GIL simplifies certain aspects of Python’s threading by preventing multiple threads from executing Python bytecodes simultaneously, it does not make Python code inherently thread-safe. Developers must still use proper synchronisation techniques to protect shared resources and prevent concurrency-related issues.

Impact Of Removing The GIL

Logically, none. Any Python code which is actually thread safe with GIL based Python will still be thread safe with FTPython. However, there my be code which appears to work (as you don’t notice the race conditions) but is not actually thread safe which will fail more often with free threading. In this case one could argue the GIL is a risk as it hides race conditions which can then bite developers when they least expect it.

Legacy Python (with the GIL) and Priority Inversion

What Happens with Thread Priorities and the GIL

In CPython prior to Python 3.13, the Global Interpreter Lock (GIL) is the central mechanism ensuring only one thread runs Python bytecode at a time. Thread scheduling is delegated to the OS, but the GIL adds an interpreter-level override: only one thread may execute Python bytecode at any time.

The GIL is released periodically based on:

The next thread to acquire the GIL is essentially chosen by the OS, but not based on Python-level priority because:

Result: Priority Inversion

A high-priority thread (e.g., real-time audio or control loop) can be blocked by lower-priority threads that happen to acquire the GIL. Worse, if those lower-priority threads are preempted or starved by the OS scheduler, they may hold the GIL but not make progress, delaying everyone.

This is classical priority inversion: a low-priority thread prevents a high-priority one from proceeding due to locking mechanics.

No Python-Level Control

Python does not allow user-level control over GIL scheduling, including:

There are no hooks to influence which thread gets the GIL next, beyond blocking in native code or using time.sleep() as a crude yield.

Implications for Real-Time or Low-Latency Systems

In GIL-locked Python:

Priority Inversion and GIL Pathology

A practical illustration: imagine a ‘golden’ thread feeding tensors to the GPU. It’s on the critical path for inference latency. Meanwhile, a logger thread is periodically flushing buffered output.

With the GIL:

In FTP:

This is not just an academic benefit. It’s how you make Python viable in systems with mixed criticality.

Free-Threaded Python Fixes Priority Inversion

In Free Threaded Python (FTP):

This makes FTP viable for:

Multiprocessing vs Fine-Grained Concurrency

Why multiprocessing Isn’t Fine-Grained

Python’s multiprocessing module sidesteps the GIL via process isolation:

While good for CPU-bound parallelism, it is unsuitable for fine-grained control because:

If you want to build concurrent in-memory data structures, multiprocessing doesn’t help.

What Multiprocessing is Good For

Why Multiprocessing Is Not Fine-Grained Concurrency

  1. Heavyweight Process Model
    • Each Python process is fully independent.
    • Spawning a process is expensive (in time and memory).
    • Inter-process communication (IPC) is slower than shared memory due to serialization (pickle) and OS overhead.
  2. No Shared Python Objects
    • Unlike threads, processes do not share memory.
    • Each process has its own copy of objects unless explicitly shared using multiprocessing.Manager, Queue, Pipe, or SharedMemory.
  3. Synchronization Is Limited and Coarse
    • Locks, Semaphores, Events: These are available in multiprocessing, but they operate via OS primitives, not per-object fine-grained locking.
    • SharedMemory (3.8+) allows faster shared access for numpy arrays or raw bytes, but requires manual memory layout and synchronization.
    • You cannot protect arbitrary Python data structures with a mutex between processes—they’re in different address spaces.

Available Tools in Multiprocessing for Coordination

Tool Type Granularity Notes
Lock, Semaphore Coarse OS-based; useful for shared counters or critical sections
Queue, Pipe Coarse Good for message passing; serialization adds latency
Manager.Value/List Very coarse Slower, proxied objects using a background server thread
shared_memory Byte-level Requires manual synchronization, useful for arrays

multiprocessing offers concurrency tools, but not fine-grained concurrency as you’d find in threading or other utilities. It excels at task-level parallelism across CPUs. It lacks low-latency, lock-free primitives and is not designed for concurrent manipulation of shared Python objects.

Combining Multiprocessing with Fine-Grained Tools

If you must have shared state with fine-grained control, you can either:

Native Code + Threads in Legacy Python (GIL-enabled)

How it Works

Native C/C++ extensions can release the GIL using Py_BEGIN_ALLOW_THREADS / Py_END_ALLOW_THREADS. This allows true parallel execution but only for code that does not touch Python objects.

When it Fails

The moment a native thread needs to:

it must re-acquire the GIL. This serializes the execution.

Why This Is Not Fine-Grained Parallelism

Aspect Limitation
Granularity Cannot safely manipulate Python data structures without taking the GIL. So most real-world work is GIL-bound.
Interleaving No interleaving of native + Python logic per-thread without GIL churn.
Scalability Parallel work is only scalable if it’s fully outside Python (e.g., pure math, IO, or C++ workloads).
Memory Access Python’s memory model is not thread-safe without the GIL. You can’t update Python containers from two native threads safely.
Design Overhead You need to segment your application logic into “GIL-free” vs “GIL-held” regions. That’s brittle and complex.

While native threads in legacy Python allow some parallelism, they are not a general-purpose, fine-grained model for concurrency. It’s more like a bolt-on escape hatch for specific use cases (e.g., I/O libraries, compute-heavy extensions).

Cython nogil is exactly the same thing

Cython supports nogil blocks, which are often suggested as a parallelism workaround:

cdef void do_work() nogil:
    # C code only

However:

This makes it useful for compute kernels, not general Python concurrency.

Closing Thoughts

Most concurrency libraries try to protect you. FTPython with ft_utils does something different; it gives you the control you need to design for correctness, rather than depending on global serialization as a crutch. Not only that, it makes key things easy to get right and provides library support of scalability and inter-thread communication. For example, lower level languages like C will crash if something is not thread say, FTPython will not crash, it might give an unexpected result but it keeps on trucking. When you hit issues ft_utils can provide more sophisticated synchronisation like readers/write locks, atomics and ConcurrentDict to tidy up thread correctness without a big performace hit.

Unitl now, Python has never been suitable for finely tuned concurrent systems. With FTP and ft_utils, that changes. You get the primitives—now you decide how to build with them.

If you’re working on systems where performance, determinism, or mixed-criticality scheduling matter, this is finally a Python that respects your intent.