The Architecture & Science Behind the PPU
Years of research, rethinking parallel performance.
Flow’s architecture is grounded in more than 30 years of scientific research in parallel processing, memory systems, and processor design. This foundation was established by CTO, Chief Architect, and Co-Founder, Martti Forsell, PhD, through his academic and applied work at the University of Joensuu and VTT Technical Research Centre of Finland.
With over 160 scientific publications and contributions to key developments in the field, Forsell’s research directly informed the design of Flow’s Parallel Processing Unit (PPU), including its core architectural innovations: Emulated Shared Memory (ESM) and Thick Control Flow (TCF).
Research foundations
The foundational research covers models of parallel and sequential computation (e.g., PRAM, BSP), shared memory emulation, cache coherency alternatives, compiler techniques for automatic parallelization, and architectural feasibility studies across both hardware and systems integration. It includes simulation environments and empirical evaluation of interconnect designs, resource usage, and throughput scalability.
One of the cornerstone achievements, the Thick Control Flow architecture, was shown to reduce synchronization costs and enable near-linear scalability. The architecture is also designed for software developer ease-of-use, eliminating the need for hardware-specific tuning and manual memory coordination.
Over the past few decades...
Moore’s Law has slowed, and Dennard scaling has effectively ended. Transistor miniaturization no longer guarantees faster or more power-efficient processors. As power density and heat dissipation have become limiting factors, the industry shifted toward multicore and parallel computing.
However, most processor architectures have failed to scale performance efficiently with increasing core counts.
[Moore65], [Dennard74]

Semiconductor limitations
For decades, the semiconductor industry relied on two trends: Moore’s Law, which doubled transistor density every 18–24 months, and Dennard scaling, which reduced voltage and current to maintain power efficiency.
But by the mid-2000s, these trends began to break down due to physical and thermal constraints. Frequency scaling stalled, and adding more cores became the default strategy for performance gains [Vishkin14].
Architectural implications
Simply adding cores introduces new architectural challenges, especially around memory access, synchronization, and developer complexity. Traditional multicore designs rely on complex cache coherence mechanisms, are prone to latency bottlenecks, and often require manual tuning to achieve decent parallel performance.
As core counts rise, these issues result in diminishing returns. This is the origin of the parallel computing challenge that Flow’s PPU was designed to solve. The PPU tackles the architectural bottlenecks that emerged as semiconductor scaling slowed [Culler99].
Modern processors face 3 major bottlenecks
- Memory access inefficiencies
- High synchronization overhead
- Poor scalability as core counts increase
These challenges limit the effectiveness of multicore architectures and significantly increase the complexity for software developers [Culler99], [Vishkin14].
SMP & NUMA implications
In Symmetric Multiprocessing (SMP) systems, all cores share a single memory space, which leads to contention and latency bottlenecks as the number of cores grows. Non-Uniform Memory Access (NUMA) architectures attempt to reduce these issues by localizing memory access, but introduce non-deterministic latencies and added software complexity.
Both SMP and NUMA rely on cache coherence protocols to maintain consistency across cores, mechanisms that become increasingly expensive and difficult to scale as systems grow.
Multithreading challenges & system-level bottlenecks
In Symmetric Multiprocessing (SMP) systems, all cores share a single memory space, which leads to contention and latency bottlenecks as the number of cores grows. Non-Uniform Memory Access (NUMA) architectures attempt to reduce these issues by localizing memory access, but introduce non-deterministic latencies and added software complexity.
Both SMP and NUMA rely on cache coherence protocols to maintain consistency across cores, mechanisms that become increasingly expensive and difficult to scale as systems grow.
Flow’s architecture combines a traditional CPU with a novel PPU.
In this hybrid model, the CPU handles sequential execution, system control, and legacy code, while the PPU is optimized for high-performance, general-purpose parallel computing.
This division of labor enables the system to scale linearly with workload complexity, while significantly reducing synchronization overhead and software development effort [Forsell22].
CPU-PPU ARCHITECTURE

with a Parallel Processing Unit (PPU) for high-performance parallel execution. Both components access a
unified memory system to simplify data handling and minimize explicit synchronization.
Thick Control Flow (TCF)

facilitating seamless execution and data exchange.
Emulated Shared Memory (ESM)

nel, and topologies are not shown.
What is TCF?
TCF is Flow’s instruction-level execution model designed to streamline parallel task execution. It eliminates many of the control-flow inefficiencies seen in traditional pipelines by executing groups of threads (fibers) in a tightly synchronized fashion.
TCF supports dynamic, nested parallelism and deterministic thread control, reducing instruction and synchronization overhead while improving scalability and developer predictability [Forsell22], [Forsell20].
What is ESM?
Traditional multicore systems rely on hardware-based cache coherence to maintain memory consistency across cores: an approach that limits scalability and adds complexity.
Flow replaces this with Emulated Shared Memory (ESM): a virtualized memory model that enables synchronized parallel access without requiring hardware coherence protocols.
ESM supports latency hiding, high memory bandwidth, and simplified programming, eliminating the need for manual memory management or synchronization primitives [Forsell22], [Forsell20].
In real-world benchmarks, Flow’s F256 configuration significantly outperformed Apple’s M1 Max, achieving up to 200× higher performance in select parallel workloads.
These results highlight Flow’s architectural advantages in compute efficiency, memory access performance, and synchronization overhead reduction.
Flow’s near-linear scalability, bandwidth optimization, and simplified programming model make it a strong alternative to traditional multicore systems [Forsell22, Forsell23].
Benchmark methodology
Benchmarks included representative compute- and memory-intensive workloads such as matrix addition, memory-bound access patterns, and synchronization routines.

Linear scaling with PPU cores outpaces traditional CPUs
Flow's architecture scaled nearly linearly with increased PPU core counts, while the M1 Max showed diminishing returns as thread counts increased [Forsell23].
Innovative architecture unlocks high-efficiency performance gains
Performance gains are driven by architectural innovations such as bandwidth scalability, latency hiding, and low-overhead synchronization, enabled by Flow’s TCF and ESM models.
Benchmark integrity ensured through fair and consistent testing
To ensure fair comparisons, all tests were run using comparable compilers and parallelization methods [Forsell23].
Code length & engineering productivity
Flow-based implementations required 50–85% fewer lines of active code than comparable Pthreads programs on Apple’s M1 Max [Forsell23].
This translates to measurable improvements in engineering efficiency, reduced development effort, and lower cognitive burden when delivering high-performance parallel software.

Flow achieves 50–85% code reduction, improving productivity and maintainability.
In benchmarked workloads, Flow’s implicit handling of synchronization and memory coordination enabled high performance with less code and fewer tuning cycles.
Flow’s programming model streamlines parallelization and allows teams to move faster without sacrificing performance [Forsell23].
Real-world use cases
Flow has been applied in compute-heavy domains such as image processing, matrix operations, and real-time signal pipelines.
Teams report improved onboarding, minimal architectural tuning, and intuitive performance scaling compared to legacy multicore systems.
Engineering effort comparisons
Legacy multicore systems often require manual synchronization, explicit memory management, and significant tuning to achieve target performance.
Flow reduces this burden, enabling faster iteration, smaller teams, and more predictable scaling [Forsell22, Forsell23].
Near-linear scaling in parallel
Flow breaks through the scaling limitations of SMP and NUMA-based architectures [Forsell22, Culler99].


Efficiency through simplified architecture
Simplified synchronization and memory handling reduce energy consumption and eliminate the need for complex cache coherency mechanisms [Forsell22, Forsell23].
Developer productivity gains
Fewer lines of code, shorter onboarding, and no low-level tuning requirements allow smaller teams to build and scale faster [Forsell23].
Broad compatibility and scalability
Flow is designed to integrate with existing systems and scale from embedded devices to cloud platforms without architectural rework [Forsell22].
References
Curious to dive deeper into the research behind Flow’s PPU?
About the author
Martti Forsell, Ph.D. is the CTO and Chief Architect of Flow Computing, co-founded alongside Jussi Roivainen and Timo Valtonen.
Before Flow
Forsell served as a researcher, Assistant, Senior Assistant and Interim Professor at the University of Joensuu, Finland as well as a Principal Scientist at VTT Technical Research Centre of Finland, where he focused on advanced computing architectures and parallel processing.
He is the inventor of the Parallel Processing Unit (PPU)
A groundbreaking technology designed to enhance CPU performance by up to 100 times. With 160 scientific publications, over 150 presentations and more than 2,700 citations, Dr. Forsell has significantly contributed to the fields of computer architecture and parallel computing. His expertise and innovative work continue to drive advancements in high-performance computing.
Curious about Flow’s performance? Let’s Talk.
Flow’s PPU architecture is redefining performance, scalability, and simplicity in high-performance computing. We've benchmarked our performance against leading processors in both categories:
- Consumer CPUs – including Apple M-series M1, M1Max, M4 Max, Intel Core Ultra 7 258V & Qualcomm SD X Elite X1E-80-100
- Server CPUs – including Google Axion & Intel Xeon Platinum 8581C
Want to see the detailed results? Fill out the form to request a performance comparison.
Let’s start a conversation.