dispenso 1.6.0
A library for task parallelism
Loading...
Searching...
No Matches
Getting Started

This guide walks through the core features of dispenso with working examples. Each section includes a complete, compilable example that you can build and run.

Coming from another library? Jump directly to the migration guide: Migrating from TBB | Migrating from OpenMP

Installation

See the README for installation instructions. Dispenso requires C++14 and CMake 3.12+.

To build the examples:

mkdir build && cd build
cmake .. -DDISPENSO_BUILD_EXAMPLES=ON
make

Basic Concepts

Thread Pools

At the heart of dispenso is the ThreadPool. A thread pool manages a set of worker threads that execute tasks. You can use the global thread pool or create your own:

// Use the global thread pool (recommended for most cases)
dispenso::ThreadPool& pool = dispenso::globalThreadPool();
// Or create a custom pool with a specific number of threads
dispenso::ThreadPool myPool(4); // 4 worker threads

Note: globalThreadPool() defaults to std::thread::hardware_concurrency() - 1 worker threads, since the calling thread typically participates in computation. Use dispenso::resizeGlobalThreadPool(n) to change it.

Task Sets

A TaskSet groups related tasks and provides a way to wait for their completion:

dispenso::TaskSet taskSet(dispenso::globalThreadPool());
taskSet.schedule([]() { /* task 1 */ });
taskSet.schedule([]() { /* task 2 */ });
taskSet.wait(); // Block until all tasks complete

Your First Parallel Loop

The simplest way to parallelize work is with parallel_for. It distributes loop iterations across available threads.

Simple per-element parallel loop:

// Process each element independently in parallel
dispenso::parallel_for(0, kArraySize, [&](size_t i) { output[i] = std::sqrt(input[i]); });

Reduction with per-thread state:

std::vector<double> partialSums;
dispenso::parallel_for(
partialSums,
[]() { return 0.0; }, // State initializer
size_t{0},
kArraySize,
[&](double& localSum, size_t start, size_t end) {
for (size_t i = start; i < end; ++i) {
localSum += input[i];
}
});
// Combine partial sums
double totalSum = 0.0;
for (double partial : partialSums) {
totalSum += partial;
}

See full example.

Key points:

  • Use the simple form for independent per-element work
  • Use chunked ranges when you want to control work distribution
  • Per-thread state enables efficient reductions
  • Options let you control parallelism and chunking strategy

Parallel Iteration with for_each

When you have a container rather than an index range, use for_each:

Parallel for_each on a vector:

std::vector<double> values = {1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0, 64.0};
// Apply square root to each element in parallel
dispenso::for_each(values.begin(), values.end(), [](double& val) { val = std::sqrt(val); });

for_each_n with explicit count:

std::vector<int> partial = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100};
// Only process first 5 elements
dispenso::for_each_n(partial.begin(), 5, [](int& n) { n += 100; });

See full example.

Key points:

  • Works with any iterator type (including non-random-access iterators)
  • for_each_n takes an explicit count
  • Pass a TaskSet for external synchronization control

Working with Tasks

For more complex task patterns, use TaskSet and ConcurrentTaskSet directly:

Basic TaskSet:

dispenso::TaskSet taskSet(dispenso::globalThreadPool());
std::atomic<int> counter(0);
for (int i = 0; i < 10; ++i) {
taskSet.schedule([&counter, i]() { counter.fetch_add(i, std::memory_order_relaxed); });
}
taskSet.wait();

ConcurrentTaskSet with nested scheduling:

dispenso::ConcurrentTaskSet taskSet(dispenso::globalThreadPool());
std::atomic<int> total(0);
for (int i = 0; i < 5; ++i) {
taskSet.schedule([&taskSet, &total, i]() {
// Each task schedules two sub-tasks
for (int j = 0; j < 2; ++j) {
taskSet.schedule(
[&total, i, j]() { total.fetch_add(i * 10 + j, std::memory_order_relaxed); });
}
});
}
taskSet.wait();

See full example.

Key points:

  • TaskSet is for single-threaded scheduling
  • ConcurrentTaskSet allows scheduling from multiple threads
  • Both support cancellation for cooperative early termination
  • The destructor waits for all tasks to complete

Futures for Async Results

When you need return values from async operations, use Future:

Basic async and get:

dispenso::Future<int> future = dispenso::async([]() {
int result = 0;
for (int i = 1; i <= 100; ++i) {
result += i;
}
return result;
});
int result = future.get(); // blocks until ready
const Result & get() const
Definition future.h:221

Chaining with then():

dispenso::Future<double> chainedFuture = dispenso::async([]() {
return 16.0;
})
.then([](dispenso::Future<double>&& prev) {
return std::sqrt(prev.get());
})
.then([](dispenso::Future<double>&& prev) {
return prev.get() * 2.0;
});

when_all for multiple futures:

dispenso::Future<int> f1 = dispenso::async([]() { return 10; });
dispenso::Future<int> f2 = dispenso::async([]() { return 20; });
dispenso::Future<int> f3 = dispenso::async([]() { return 30; });
auto allFutures = dispenso::when_all(std::move(f1), std::move(f2), std::move(f3));
auto tuple = allFutures.get();
int sum = std::get<0>(tuple).get() + std::get<1>(tuple).get() + std::get<2>(tuple).get();
Future< detail::ResultOf< F, Args... > > async(std::launch policy, F &&f, Args &&... args)
Definition future.h:453
Future< std::vector< typename std::iterator_traits< InputIt >::value_type > > when_all(InputIt first, InputIt last)

See full example.

Key points:

  • async() launches work and returns a Future
  • then() chains dependent computations
  • when_all() waits for multiple futures
  • make_ready_future() creates an already-completed future

Task Graphs

For complex dependency patterns, build a task graph:

Diamond dependency pattern:

#include <dispenso/graph.h>
// A
// / \
// B C
// \ /
// D
dispenso::Node& A = graph.addNode([&]() { r[0] = 1.0f; });
dispenso::Node& B = graph.addNode([&]() { r[1] = r[0] * 2.0f; });
dispenso::Node& C = graph.addNode([&]() { r[2] = r[0] + 5.0f; });
dispenso::Node& D = graph.addNode([&]() { r[3] = r[1] + r[2]; });
B.dependsOn(A);
C.dependsOn(A);
D.dependsOn(B, C);
setAllNodesIncomplete(graph);
dispenso::ConcurrentTaskSet taskSet(dispenso::globalThreadPool());
executor(taskSet, graph);
N & addNode(T &&f)
Definition graph.h:622
void dependsOn(Ns &... nodes)
Definition graph.h:278

See full example.

Key points:

  • Use dependsOn() to specify prerequisites
  • Multiple executors available: single-thread, parallel_for, ConcurrentTaskSet
  • Graphs can be re-executed after calling setAllNodesIncomplete()
  • Subgraphs help organize large graphs

Pipelines

For streaming data through stages, use pipelines:

3-stage pipeline (generator -> transform -> sink):

std::vector<int> results;
int counter = 0;
dispenso::pipeline(
// Stage 1: Generator - produces values
[&counter]() -> dispenso::OpResult<int> {
if (counter >= 10) {
return {}; // Empty result signals end of input
}
return counter++;
},
// Stage 2: Transform - squares the value
[](int value) { return value * value; },
// Stage 3: Sink - collects results
[&results](int value) { results.push_back(value); });

See full example.

Key points:

  • Generator stage produces values (returns OpResult<T> or std::optional<T>)
  • Transform stages process values (can filter by returning empty result)
  • Sink stage consumes final values
  • Use stage() with a limit for parallel stages

Parallel Invoke

For launching a fixed set of heterogeneous tasks in parallel (not a loop), use parallel_invoke. It's the classic fork-join idiom: submit siblings to the pool, run the last one inline on the calling thread.

Fork-join with parallel_invoke:

#include <dispenso/parallel_invoke.h>
dispenso::ConcurrentTaskSet tasks(dispenso::globalThreadPool());
double a = 0.0, b = 0.0, c = 0.0;
dispenso::parallel_invoke(
tasks,
[&]() { a = expensiveComputation1(); },
[&]() { b = expensiveComputation2(); },
[&]() { c = expensiveComputation3(); } // runs inline
);
tasks.wait();
double result = a + b + c;

See full example.

Key points:

  • The last functor runs inline on the calling thread
  • Does not call wait() — the caller drives synchronization
  • Composes naturally with recursive divide-and-conquer algorithms
  • No type erasure or allocation overhead

CPU Affinity and Topology

CpuSet provides portable CPU affinity and NUMA topology detection:

Query topology and pin threads:

// Query available CPUs
int32_t available = dispenso::CpuSet::availableCount();
printf("Available CPUs: %d\n", available);
// Get cache-sharing groups for cache-aware scheduling
const auto& l3Groups = dispenso::CpuSet::l3CacheGroups();
for (const auto& group : l3Groups) {
printf("L3 cache group (id %d): %zu CPUs\n", group.cacheId, group.cpus.size());
}
// Pin the current thread to a specific CPU
pinSet.add(0);
A set of CPU IDs for affinity manipulation and topology queries.
Definition cpu_set.h:139
static DISPENSO_DLL_ACCESS const std::vector< CacheGroup > & l3CacheGroups()
Returns the L3 cache sharing groups.
DISPENSO_DLL_ACCESS void add(int32_t hardwareThread)
Adds a single CPU to the set.
static DISPENSO_DLL_ACCESS int32_t availableCount()
Returns the number of hardware threads available to this process.
DISPENSO_DLL_ACCESS bool bindCurrentThread() const
Binds the calling thread to the CPUs in this set.

See full example.

Key points:

  • Works on Linux (full support), Windows (full support), macOS (topology only)
  • CpuSet::l2CacheGroups() / l3CacheGroups() return CPU groups that share cache levels
  • Use for NUMA-aware thread placement and cache-friendly scheduling

Thread-Safe Containers

ConcurrentVector

A vector that supports concurrent push_back and growth:

Concurrent push_back from multiple threads:

dispenso::parallel_for(0, 1000, [&vec](size_t i) { vec.push_back(static_cast<int>(i)); });
iterator push_back(const T &val)

Iterator stability during concurrent modification:

vec.push_back(1);
vec.push_back(2);
vec.push_back(3);
auto it = vec.begin();
int& firstElement = *it;
// Push more elements concurrently
dispenso::parallel_for(0, 100, [&vec](size_t i) { vec.push_back(static_cast<int>(i + 100)); });
// Original iterator and reference are still valid
assert(*it == 1);
assert(firstElement == 1);

See full example.

Key points:

  • Iterators and references remain stable during growth
  • Use grow_by() for efficient batch insertion
  • Reserve capacity upfront when size is known
  • Not all operations are concurrent-safe (see docs)

SmallVector

A vector-like container with inline storage for small sizes, avoiding heap allocation when element count stays below a compile-time threshold:

// Store up to 8 elements inline (on the stack)
for (int i = 0; i < 6; ++i) {
vec.push_back(i); // No heap allocation — fits inline
}
vec.push_back(100); // Still inline
vec.push_back(200); // Still inline (8 elements)
vec.push_back(300); // Transitions to heap — transparent to the caller
void push_back(const T &value)

See full example.

Key points:

  • Template parameter N controls inline capacity (default 4)
  • Transparent fallback to heap when size exceeds N
  • Once on heap, stays on heap until clear() — preserves reserve() guarantees
  • Same API as std::vector (iterators, push_back, operator[], etc.)

ChaseLevDeque

A lock-free single-producer, multi-consumer work-stealing deque. The owner pushes and pops from the bottom; thieves steal from the top:

// Owner pushes work
for (int i = 0; i < 100; ++i) {
deque.try_push(i);
}
// Owner pops (LIFO — most recent first)
int val = 0;
bool ok = deque.try_pop(val);
// Thieves steal (FIFO — oldest first, from other threads)
int stolen = 0;
bool got = deque.try_steal(stolen);
A lock-free SPMC bounded work-stealing deque.
bool try_pop(T &out)
Pops the most recently pushed element (LIFO, owner-only).
bool try_steal(T &out)
Steals the oldest element from the top of the deque (FIFO).
bool try_push(const T &item)
Pushes an element onto the bottom of the deque (owner-only).

See full example.

Key points:

  • Lock-free and wait-free for the owner (push/pop)
  • steal() is lock-free for thieves
  • Dynamically resizable — grows as needed
  • Classic data structure for work-stealing schedulers

Synchronization Primitives

Latch

A one-shot barrier for thread synchronization:

count_down + wait pattern:

#include <dispenso/latch.h>
constexpr int kNumWorkers = 3;
dispenso::Latch workComplete(kNumWorkers);
std::vector<int> results(kNumWorkers, 0);
dispenso::ConcurrentTaskSet taskSet(dispenso::globalThreadPool());
for (int i = 0; i < kNumWorkers; ++i) {
taskSet.schedule([&workComplete, &results, i]() {
results[static_cast<size_t>(i)] = (i + 1) * 10;
workComplete.count_down(); // Signal work is done (non-blocking)
});
}
workComplete.wait(); // Main thread waits for all workers

See full example.

Key points:

  • arrive_and_wait() decrements and blocks
  • count_down() decrements without blocking
  • wait() blocks without decrementing
  • Cannot be reset (one-shot)

Resource Pooling

Manage expensive-to-create resources with ResourcePool:

Basic buffer pool with RAII:

// Create a pool of 4 buffers
dispenso::ResourcePool<Buffer> bufferPool(4, []() { return Buffer(); });
dispenso::parallel_for(0, 100, [&bufferPool](size_t i) {
// Acquire a resource from the pool (blocks if none available)
auto resource = bufferPool.acquire();
// Use the resource
resource.get().process(static_cast<int>(i));
// Resource automatically returned to pool when 'resource' goes out of scope
});

See full example.

Key points:

  • Resources automatically return to pool when RAII wrapper destructs
  • acquire() blocks if no resources available
  • Good for database connections, buffers, etc.
  • Can be used to limit concurrency

Next Steps

  • Browse the API Reference for complete documentation
  • Check out the tests for more usage examples
  • See the benchmarks for performance testing patterns