Interested to discover further details about the Parallel Processing Unit (PPU) & Flow?

/FAQ

Our FAQ provides plenty of information about the Parallel Processing Unit and Flow

If you are curious about the Parallel Processing Unit (PPU) or Flow in general, check out our extensive FAQ. It includes more information, for example, about the benefits of the Parallel Processing Unit (PPU) and its key design principles + characteristics. There is also more background on the company and its business.

Please ping us at info@flow-computing.com in case there is a certain piece of information that you would like to know, but it is not covered in our FAQ.

Benefits of the Parallel Processing Unit (PPU)

What problem is Flow trying to solve?
We believe that there have been only incremental improvements in CPU performance during the recent decades. In our opinion, it has led to a situation where the CPU has actually become the weakest link in computing, due to its sequential architecture that is suboptimal for modern workloads. A new era in CPU performance has become a necessity to meet the continuously-increasing demand for more computing performance (driven to a large extent by needs in AI, edge and cloud computing). Flow intends to lead this revolution through its radical new Parallel Processing Unit (PPU) architecture, enabling up to significant performance boosts in parallel functionalities. The PPU can be applied to any CPU architecture and still maintain full backwards software compatibility.

What are the core benefits Flow offers better than its competitors?
Flow’s architecture is referred to as a Parallel Processing Unit (PPU). It boosts CPU performance significantly to enable a totally new era of CPU performance. The PPU is fully backwards compatible with existing software and applications - the parallel functionality in legacy software and applications can be accelerated by recompiling them for the PPU, even without any code changes. The more parallel functionality there is, the greater the resulting performance boost. Our technology is also complementary in nature - while it boosts the CPU, all other connected units (such as matrix units, vector units, NPUs and GPUs) will indirectly benefit from the performance of the PPU, and get a boost from the more capable CPU.

Can a GPU benefit from PPU architecture?
GPUs will most definitely take advantage (indirectly) of PPU usage in the CPU. While CPU performance is greatly improved by the PPU, most computations between the CPU and GPU will also benefit from this improvement. The overall performance of CPU + GPU configurations will significantly benefit from the PPU.

What types of devices benefit most from Flow’s PPU architecture?
All devices that require high-performance CPUs will hugely benefit from Flow’s PPU licensing.

What are the core characteristics and design principles of a Parallel Processing Unit (PPU)?

To highlight the hardware advantages of Flow over traditional SMP/NUMA CPU (and GPU) computing, let’s take a closer look at several key differences between them:

A. Nonexistent cache coherence issues. Unlike in current CPU systems, in Flow’s architecture there are no cache coherence issues in the PPU’s memory system due to the memory organization excluding caches at the front of the intercommunication network.

B. Cost-efficient synchronization. Flow’s synchronization cost is roughly 1/Tc (Tc=fibers per PPU core), whereas in SMP/NUMA CPU systems it can range from hundreds to thousands of clock cycles, and in GPUs from thousands to hundreds of thousands of clock cycles, i.e., synchronization inside the PPU is almost cost-free.

C. Support for parallel computing primitives. Flow’s architecture provides unique and specific techniques/solutions for executing concurrent memory access operations (both read and write), multi-operations for executing reductions, multi-prefix operations, compute-update operations, and fiber mapping operations in the most efficient manner possible. These primitives are not available in current CPUs. Implementing them in the PPU involves active memory technologies at the SRAM-based on-chip shared cache level, potentially providing better performance and greater flexibility than current DRAM-based processing-in-memory (PIM) solutions, while still allowing both techniques to be used in the same design.

D. Flexible threading/fibering scheme. Flow Computing’s technology allows for an unbounded number of fibers at the model level, which can also be supported in hardware (within certain bandwidth constraints). In current-generation CPUs, the number of threads is - in theory - not bound, but if the number of hardware threads is exceeded in the case of interdependencies, performance can be slower than in a single-threaded version. In addition, operating systems typically limit the number of threads to a few thousand at most. The mapping of fibers to backend units is a programmable function allowing further performance improvements with a PPU.

E. Low-level parallelism for dependent operations. In PPU-enabled CPUs, it is possible to execute dependent operations with full utilization within a step (enabled by the chaining of functional units), whereas in current CPUs, the operations executed in parallel must be independent due to the parallel organization of their functional units.

F. Non-existent context switching costs. In PPU-enabled CPUs, fiber switching has zero cost, whereas in current CPUs, context switching typically takes over 100 clock cycles.

G. Intercommunication traffic congestion avoidance. In PPU-enabled CPUs, the probability of intercommunication network traffic congestion is low due to hardware support for hashing, concurrent memory access, multi-operations and multi-prefix operations. In current CPUs, congestion can occur frequently if access patterns are non-trivial.

H. Scalable latency tolerance. Flow’s technology provides a scalable latency-hiding mechanism for constellations of up to thousands of cores, whereas in current CPUs, latency tolerance (with snooping/directory-based cache coherence maintenance mechanisms) appears to scale poorly. There is also evidence that in high-end PPU-enabled systems, even the latency of DRAM-based memory systems can often be hidden with suitable memory organization and sufficient bandwidth.

I. No need for locality-maximizing memory data partitioning. PPU-enabled CPUs work well with all kinds of memory partitioning schemes, whereas the performance of current CPUs is highly sensitive to schemes not maximizing the locality of references.

J. Sufficient intercommunication bandwidth. Flow’s PPU has an intercommunication network that is designed to provide sufficient bandwidth for random communication, whereas current CPUs are limited only to cases where most references are local. This kind of locality maximization is not always possible, as there is no general algorithm to maximize locality.

K. Dual unit organization. Current multicore CPUs were built by replicating processors originally designed for sequential computing and are therefore optimized for low latency. As a result, they perform relatively well on executing sequential workloads but have substantial performance issues with non-trivial parallel functionalities.

To support high-speed parallel code execution, Flow introduces the PPU, which leverages parallel slackness to reorganize operations, shifting the need for low-latency to a need for high throughput, and integrating it with a CPU. The resulting dual-unit CPU-PPU architecture combines the best of both worlds to achieve the best performance for modern workloads containing a lot of parallelism, but also some sequential parts, while maintaining backwards compatibility via the CPU.

L. Minimal disadvantages of superpipelining while retaining all benefits, including full support for long latency, floating-point and application-specific operations. Flow Computing’s PPU is a fully superpipelined design with a regular structure and patented support for long-latency operations, floating-point operations, and optional application-specific operations. In current CPUs, superpipelining increases the degree of pipeline delays, which may cannibalize the performance benefits.

M. Parametric design and instruction set independence. Flow is not limited to single
instances only, as it features parametric blocks with design-time adjustable number of CPU cores, number of PPU cores, number and types of functional units per PPU core, size and organization of step caches, scratchpads, on-chip shared caches, latency compensation unit length, instruction set etc. Current CPU designs are typically tied to certain instruction sets and may require partial redesign if these kinds of parameters are altered.

N. Support key patterns in parallel computing. Flow supports the key patterns of parallel computation, such as parallel execution, reduction, spreading and permutation. Current-generation CPUs only support the parallel execution pattern without slowdown.

Parallel Processing Unit (PPU) compared to other chip units and solutions

How does the PPU compare to the existing chip units inside the die?
The blocks inside the CPU die are optimized and meant for different purposes - vector units for vector calculation, matrix units for matrix calculation, etc. The Parallel Processing Unit is optimized for parallel processing.

On-die GPUs and NPUs are also used for accelerating predefined use cases. NPUs in particular have very limited use and cannot be used for general-purpose computation. GPUs are more versatile than NPUs but they can only effectively process tasks with easy memory access patterns and synchronization requirements.

Programming the PPU is part of the CPU code. Other on-die units need special programming and coders trained for those architectures.

What are the key differences between PPUs and GPUs?
The PPU is optimized for parallel processing, while the GPU is optimized for graphics processing. The PPU is more closely integrated with the CPU, and you could think of it as a kind of co-processor, whereas the GPU is an independent unit, loosely connected to the CPU.

Improving the CPU’s parallel processing capability with a PPU brings significant benefits. In GPU programming, the width of parallelism is fixed within a kernel, while the width of the PPU can vary. This flexibility helps avoid the inefficiencies often seen in GPU kernels. Starting a kernel involves some overhead, i.e. there is a minimum amount of work before outsourcing to the GPU is profitable. In contrast, the PPU works with a significantly wider range of programs because it can be utilized as an integral part of the code without creating a separate kernel.

Don’t GPU’s already provide parallel processing capabilities in both the rasterization and geometry pipelines? Why add them to the CPU?
Improving the CPU’s parallel processing capability with a PPU brings significant benefits. In GPU programming, the width of parallelism is fixed within a kernel, while the width in PPU can vary. This often brings inefficiencies to GPU kernels, which are avoided in Flow Computing’s PPU architecture. Starting with a kernel involves some overhead. There is a minimum amount of work before outsourcing to the GPU becomes profitable. In contrast, the PPU works for a significantly wider range of programs because the PPU can be utilized as an integral part of the code, without requiring the creation of a separate kernel. Moreover, CPU and GPU memories are normally separate, which leads to memory consistency challenges.

What kind of PPU would need to be added to a CPU to equal the performance of a high-end GPU?
Our goal is not to replace the GPU, but to improve the performance of the weakest link of computation: the CPU. CPUs powered by Flow's PPU boost performance, enabling what we call SuperCPUs, and improving the performance of the entire system, including GPUs. CPUs, and future SuperCPUs are primarily designed for different functionalities than GPUs. When the functionality requires non-trivial access patterns or contains inter-thread dependencies, PPU-enabled CPUs will be much faster than GPUs.

GPU vs. CPU is a bit beside the point. The more interesting reference point is the current CPU and the next-generation SuperCPU, powered by Flow’s PPU. NVIDIA does have its own Grace CPU Superchip with 144 cores (that is 2x72 cores). If a comparable system would have a CPU with 72 cores, coupled with 64 cores of PPU, it would likely have much better performance than the current Grace CPU. Integrating a system like Grace CPU with Flow’s PPU, coupled to configurations with powerful GPUs, like the current Blackwell and the future Rubin series, would raise the performance bar tremendously.

In what way does Flow differ from, and offer advantages over, architectures that combine a CPU and GPU on a single silicon chip?
The PPU provides better utilization of compute resources than GPUs because, in PPUs, the amount of parallelism can be dynamically set to follow the optimum, whereas in GPUs it is more or less fixed. Processing in the PPU starts immediately as part of a CPU program, whereas in GPUs, a kernel must be launched and executed outside the CPU. Starting a kernel involves some overhead, i.e., there is a minimum amount of work before offloading to the GPU becomes profitable. Moreover, CPU and GPU memories are normally separate, which necessitates explicit data transfers or implicit synchronization overhead in the mapped memory regions.

In GPU programming, the width of parallelism is fixed within a kernel, while the width in PPU can vary. This causes inefficiencies in GPU kernels that are avoided in Flow’s PPU. Starting a kernel involves some overhead, i.e., there is a minimum amount of work before offloading to the GPU becomes profitable.

In contrast, the PPU can directly start within program execution, which enables its use for smaller workloads and for a wider range of programs. Moreover, CPU and GPU memories are normally separate, which necessitates explicit data transfers or implicit synchronization overhead in mapped memory regions.

Unifying CPU and GPU memory into a single physical memory will thus bring its own challenges, both in hardware design and in programming. As an example, the question of when memory writes made by the CPU become visible to the GPU, and vice versa, leads to memory consistency models that normally complicate reasoning about program correctness, or leads to a safe but inefficient programming style. To summarize, programming for CPU+PPU is more comfortable than programming for CPU+GPU, with significantly greater flexibility.

Some CPU accelerators introduce extra delays in computation. Does the PPU suffer from this problem?
Flow’s PPU is tightly connected to the CPU. There will be minor latency in passing parameters to the PPU, but it rarely causes delays due to the natural overlapping of PPU and CPU operations. Even if there were a minimal delay, the performance gains of the PPU would eliminate the potential minor slowdown by a large margin. Sequential legacy code will be executed by the CPU without PPU involvement, thus latency remains unchanged in this case.

When the PPU executes the parallel parts of the code, it takes a different approach to latency than the CPU - rather than minimizing the latency of individual instructions, it exploits the slack of parallelism to maximize throughput. This is, e.g., used to hide the latency of memory operations by executing other threads while accessing memory. The need for cache coherence maintenance traffic is eliminated by placing no caches at the front of the intercommunication network. Scalability is provided via a high-bandwidth network-on-chip, ultimately supporting the memory access needs of general parallel computing.

Use cases for the Parallel Processing Unit (PPU)

Is there a theoretical limit to what kind of PPU can be added to a typical mobile, PC or server CPU?
PPU is parametric - it can be configured for any desired use case: number of PPU cores (e.g., 16, 64, 256...), number and type of functional units (e.g., ALUs, FPUs, MUs, GUs, NUs), size of on-chip memory resources (caches, buffers, scratchpads), etc. The performance boost scales up with the number of PPU cores: for very small devices (e.g., a smart watch), a PPU with 4 cores would work well; a PPU with 16 cores would be suitable for smartphones and laptops; a PPU with 64 cores would work well with desktop computers, and a PPU with 256 cores would likely be the most suitable configuration for AI and edge computing servers.

Can Flow really be used in anything from a mobile phone to a supercomputer?
Yes, because the PPU is both configurable and parametric, it suits a wide range of use cases. The number of PPU cores (e.g., 4, 16, 64, 256...), the number and type of functional units (e.g., ALUs, FPUs, MUs, GUs, NUs), and the size of on-chip memory resources (caches, buffers, scratchpads) are all parametric.

The performance boost scales up with the number of PPU cores: for very small devices (e.g., a smart watch) a PPU with 4 cores is highly suitable; 16 cores for smartphones and laptops; 64 cores for PCs; and a PPU matrix with 256 cores or more would likely be the most suitable configuration for a supercomputer.

What are the use cases for Flow in AI?
Data pre- and post-processing currently accounts for up to 50% of the total time when a LLM is in a new language. This can be significantly reduced by high-performance, PPU-powered CPUs. In addition, locally hosted AI would become far more feasible. Many AI problems are parallel in nature, thus improved parallel processing performance could make a significant impact.

What are the use cases for Flow in supercomputers and the defense industry?
Our technology can dramatically improve standard supercomputer performance! In addition, the PPU can be configured to REDUCE power consumption! Due to the parametric nature of PPU, the uptake in performance can be used to save power used for processing - so a 100x performance boost could be traded for a 10x performance boost with 10x less power consumption.

In the defense industry - missiles, drones, and missile and drone defense devices are the most lucrative use cases, alongside military aviation. Whoever processes the data and calculates it the fastest will win in warfare. Flow has a major geopolitical and defense impact.

What is the estimated performance benefit of using Flow Computing’s technology? In particular, how are the parallel computing performance gains likely to translate into improvements at the full application level?
The question of performance benefits is answered separately for software that is specifically written with Flow Computing’s technology in mind, and legacy software. For the latter, the availability of source code allows programmers to re-compile with a compiler that is aware of Flow Computing’s technology, and can provide performance improvements that are detected automatically by Flow’s compiler. A definite advantage is the possibility to run “classic code” on the CPU cores and exploit parallelism on the PPU cores whenever it occurs in an application.

Think of it as similar to the original PowerPC chip in 1990’s Macs. Older 68K software ran in compatibility mode, while the new software ran much faster using the PowerPC instruction set. A further performance benefit for legacy code can be obtained if an operating system or programming system libraries (e.g., sorting via qsort() in the C library) can be ported to utilize Flow and then run faster on the PPU’s cores, even if the application code itself is unmodified. There will be significant performance gains for most types of applications, especially those that exhibit degrees of parallelism but cannot be parallelized with current thread-based schemes.

For what types of algorithms does Flow’s technology likely work, and how widely can it be applied? Are there some specific areas it is unlikely to work in?
In fields such as numeric and combinatorial simulation and optimization, which are widely used in business computing (from logistics planning to forecasting investments), will greatly profit from Flow. Such applications tend to be heavily parallelized, often for GPU clusters, and will benefit from the flexibility of thick control flows over GPU thread blocks. Flow’s technology also works for classic-numeric and non-numeric parallelizable workloads ranging from matrix or vector computation to sorting. In code with small parallelizable parts that are not parallelized because the runtime overhead is larger than the runtime benefit from parallelization, Flow still brings a performance boost.

A growing field in which Flow’s technology is highly applicable is artificial intelligence. Both machine learning (e.g., training neural networks) and symbolic AI (often searching through large graphs) can benefit from Flow, as these applications currently are often used on GPUs but require CPU involvement in pre and post processing and can play with irregular patterns. In addition, the regularity requirement in GPU computing limits parallelization on GPUs, something which can be overcome by the flexibility of Flow Computing.

Adaptation of the Parallel Processing Unit (PPU)

Is Flow’s architecture dependent on state-of-the-art manufacturing processes?
Flow’s PPU can be integrated into any current or pending design architecture or silicon process.

Is Flow foundry- or architecture-dependent?
Flow is completely foundry and architecture independent - no changes to tools or processes are required. It can also be used with all instruction sets.

What is the applicability of Flow’s technology? How widely could PPUs be adapted?
Flow’s technology is adaptable to any microprocessor-based system that currently uses single and multiple processor cores and/or massively parallel accelerator devices such as GPUs, as the CPU cores handle the functionality with limited parallelism and the PPU cores take on the role of the accelerator. In that case, PPU cores can take over work from the CPU processor cores for handling parallel parts, but are not currently outsourced to the accelerator because the parallel work is either too irregular or too small to justify the overhead.

Thus, the technology becomes widely adaptable, ranging from classic desktop and laptop computers to embedded systems and digital signal processors, to smartphones, where a part of the workload is not from the user (such as video decoding), but is generated from the system itself (e.g., software parts of the radio stack, which are numeric, such as digital signal processing applications, or at least high-throughput, such as network stack processing). Current microprocessor-based systems face the unenviable situation where operating frequencies can no longer be increased, so performance improvements for a single application must come from the use of parallelism, rather than from faster operation frequencies. Furthermore, as memory devices do not become correspondingly faster, access to main memory is quite slow when measured in processor core cycles (hundreds of cycles). As a result, main memory access must be avoided as much as possible through the use of fast but small caches to store frequently accessed data items.

Efficient cache use necessitates a programming style that leads to particular memory access patterns, which often hinders parallelization. Flow’s technology brings advantages in both areas: the PPU allows a more flexible, and therefore more widespread, use of parallelization compared to parallel threads, which also supports programmability and programmer efficiency. Emulated shared memory hides long latencies instead of avoiding them, by exposing them to the CPU’s processor cores and allowing the cores to better schedule other tasks for execution while the PPU waits for a memory read.

When is Flow announcing the availability of this IP platform in its entirety?
Flow exited stealth in June 2024 with the announcement of its incorporation, funding, and the basic details of its patented PPU architecture. Flow is still developing its IP platform and product further, so stay tuned for our future progress and full details of our technical innovations. Companies that commit to early access to the technology will naturally receive more technical details early on.

Are there any fabs currently aligned with Flow? If not, which are most suited to deploying this?
Not at the moment. Flow is totally fabless and foundry, and architecture-independent: no changes to tools or processes are required. Also, it can be used with all instruction sets. Thus, CPUs with the PPU can be deployed by any fab.

How does Flow provide its IP to a licensee? VHDL source, source code, final design schematics, compiled software?
Our target is to provide our IP to the RISC-V market as soft IP, i.e., synthesizable HDL. For the ARM and X86 markets, we will offer architecture-type licensing, which allows licensees to use our patents and other IP to implement their own PPU.

Does Flow work with LLVM or support custom compilers?
Yes, our software stack is built on top of the LLVM compiler infrastructure, and we are developing custom extensions to support the PPU architecture. This allows us to integrate with existing toolchains and makes parallel programming more accessible to developers familiar with standard languages. As our technology matures, we plan to provide further documentation and tooling to support developers and partners.

Can developers or researchers access Flow’s technology or tools?
Flow is currently collaborating with selected partners during its early access phase. Broader availability of our development tools, documentation, and evaluation environments is planned as we approach commercial release. If you're a developer, researcher, or organization interested in testing or integrating our technology, we encourage you to get in touch.

Size, cost, and power consumption estimates of the Parallel Processing Unit (PPU)

How much die space does a PPU require to achieve significant performance over standard architectures?
It depends on the system configuration. In systems with a high number of processor cores, it is expected that several CPU cores could be substituted by the PPU. The PPU uses leftover die space without requiring any extra silicon area.

Our initial, very rough silicon area estimation model is based on legacy silicon technology parameters and public scaling factors. For a 64-core PPU achieving a 38X - 107X speed-up in laboratory tests, the initial estimated silicon area is 21.7 mm^2 area in a 3 nm silicon process. For a 256-core PPU achieving a 148X - 421X speed-up, the estimated area is 103.8 mm^2.

How much additional power draw does a pipeline of PPUs typically require?
The PPU can actually be configured to REDUCE power consumption! Due to its parametric nature, performance can be traded for lower energy use. For example, a 100x performance boost could be traded for a 10x performance boost with 10x lower power consumption.

Power draw depends heavily on the desired configuration. Our initial power consumption estimation model (based on legacy silicon technology parameters and public scaling factors), indicates 43.4W consumption for a 64-core PPU delivering a 38X - 107X speedup in laboratory tests if a 3nm silicon process were used, and 235W consumption for a 256-core PPU delivering a 148X - 421X speedup.

What are the tradeoffs between maintaining full backwards software compatibility with existing architectures (e.g., x86, ARM, Power) vs. maximizing performance?
ALL existing software is compatible with CPUs that have PPU matrices built in. The level of performance boost depends on the amount of parallelism in the software - the more parallelism, the more boost PPU will generate without any software code changes using only recompiling. If the libraries are already optimized for Flow, even more performance gains are achieved without any additional steps.

For maximum performance gains, it is possible to refactor the critical parts of the code or rewrite it entirely as a native parallel. We will develop AI/smart compiler tools to help companies identify which parts of their software can be parallelized.

Business and markets for Flow

What are some major clients Flow is currently working with?
We are in positive discussions with leading-edge CPU companies such as AMD, ARM and Intel to co-develop the future era of advanced CPU computing starting with server CPUs. We are also interacting with companies looking into the server CPU market, like Qualcomm and SiFive and others using the open-source RISC-V instruction set.

Are there any target markets / products that Flow plans to especially address?
Flow’s unique PPU architecture excels in general-purpose parallel computing and in the most demanding applications such as locally-hosted AI. We can also turbocharge server/cloud CPUs in data centers for uses such as edge and cloud computing, AI clouds, and more.

What is Flow’s business model, and what makes it almost unique among semiconductor companies?
Our business model is on licensing our technology, just as ARM does, to various licensees around the world. The PPU is totally independent of instruction set design, so it can be used in any modern CPU and integrated into any current or pending design architecture using any silicon process.

Who are the company’s primary competitors?
Since the PPU is a unique, one-of-a-kind product that we have invented, patented and trademarked, we do not have any direct competitors. In a sense, our biggest competitor is the fact that the industry and CPU manufacturers keep using the current ways of improving their performance - e.g., adding more processor cores, using smaller feature sizes and increasing the clock frequency instead of utilizing new alternatives, offering FAR higher performance.

What CPU vendors are most likely to consider licensing the Flow PPU architecture and why?
Our vision is for the PPU to be used in all future high-performance CPUs. The most likely CPU vendors to license Flow’s PPU are leading-edge CPU companies such as ARM, AMD, Intel, Apple, NVIDIA, Samsung and Qualcomm. We also see strong potential for hyperscaler companies that use custom CPUs for specific use cases requiring the utmost from CPU performance. Most RISC-V CPU companies are successfully developing these types of custom CPUs.

Why would compute powerhouses like AMD, Apple, ARM, Intel, NVIDIA, and Qualcomm ever consider licensing Flow’s PPU when they have already invested billions in their own designs?
These companies are constantly looking for breakthrough technologies to improve their products. The technologies used in current multicore CPUs cannot be used to solve the inefficiencies that arise when executing parallel functionalities. Flow is developing a unique product with the PPU, one that will enable these computing powerhouses to step into a new era of CPU performance. The PPU is complementary in nature, benefitting all CPU computations and instruction sets. It is usable in independent fitting for all existing fabs, foundries, tools and processes, so PPU fits in easily into their own designs. We are now looking to engage with these leading-edge CPU companies to co-develop the future era of CPU computing. Parallel portions of code can be expressed as natural parallel statements without concerns about race conditions, deadlocks and synchronizations.

Licensing a core architecture from a new, unknown startup is a BIG ask for a company like Intel (or any other one, for that matter). How can Flow assure a potential licensee that the architecture will be around for the foreseeable future?
This is a legitimate concern. We plan to increase resourcing and scale-up capabilities, such as technical customer support and other variables, to provide rock-solid assurances to our customers. Our investors are fully committed to backing us in our growth journey, and we are obviously looking to bring new investors onboard in future financing rounds. With positive market traction, we are certain of our ability to continue growing to meet the rigorous demands of our licensees. Customers who license our architecture will have full technical control over their design.

It all sounds great - but if it’s so amazing, why haven’t the multi-billion dollar chip companies already done it? What’s the catch?
Parallel processing is not a “new kid on the block” in computing. It’s actually been around conceptually since the 1970’s. It was never mainstream in these early days of the PC due to its programming complexity and inefficient architecture. As a result, architectural selections in the past led to current multicore CPUs replicating processors originally optimized for sequential computing.

Over time, the industry has learned to settle for incremental performance gains achieved by adding cores and clock-frequency. Moreover, the industry has become complacent with cumbersome and unproductive programming techniques. We have been driven by this and have meticulously researched and developed parallel processing technology. With the PPU and our technology stack, we can finally combine all the benefits of current CPUs and parallel processing.

Flow as a company

How long has Flow been in business as an independent company?
The company was established in January 2024 as a spin-off from VTT Technical Research Center of Finland.

How much money has the company raised to date?
In total, Є4M.

What round of funding is the company currently in?
Flow closed the first funding round (pre-seed) on January 31st, 2024.

How many people does Flow employ?
The company has three co-founders and has hired the best pan-European industry talent to work on the first commercial version of PPU and its compiler. The current headcount is ten persons.

Which angels and venture firms have invested in Flow to date?
Butterfly Ventures (Finland), Sarsia (Norway), Stephen Industries (Finland), Superhero Capital (Finland), and FOV Ventures (Finland/UK).

What were some of the key reasons VC’s decided to invest in Flow?
Our investors were especially excited about the innovativeness and uniqueness of Flow’s technology, its strong IP portfolio, and the potential to enable a new era of superCPUs for the AI revolution.

Who are Flow's founders and what makes them qualified to create all of this?
Flow was founded by Dr. Martti Forsell, Jussi Roivainen, and Timo Valtonen. Dr. Forsell has been researching parallel processor architectures and programming for several decades, first at the University of Joensuu (now the University of Eastern Finland) and later at the VTT Technical Research Center of Finland, where Jussi Roivainen joined his research team. Timo Valtonen joined the team to drive and plan the commercialization of the research and technology.

The initial idea was to develop the fastest CPU in the world! Alongside this original idea, the team started in parallel (bad pun, sorry) to explore the possibility of creating a product that could be used by all CPU manufacturers. The PPU and Flow were born from this, alongside the vision that Flow’s PPU would be used in all high-performance CPUs, ushering in a new era in CPU computing. Years of joint work have led to this point, and the founders now have a funded company to fulfill that vision.

Who are Flow's advisory team members?
Flow doesn't have an advisory team yet. We are currently evaluating candidates to form an advisory team later this year, balanced between marketing and technical expertise.

Does Flow plan to offer regional offices in other geographies? How long has Flow been in business?

Flow Computing is headquartered in Finland, and the current team is spread across Europe, in several European countries. In the future, our plan is to have an office in the USA.

Where was Flow’s original design architecture conceptualized? How long has Flow been in business?

Dr. Martti Forsell began researching parallel processor architectures and parallel computing at the University of Joensuu (now the University of Eastern Finland). He continued his research at VTT Technical Research Center of Finland, together with Jussi Roivainen. Flow Computing’s technologies and patents are the result!

Does VTT Research still own a stake in the company or rights to IP?
VTT is a minority owner of Flow Computing after a significant IP apport to the company. Flow has full rights and ownership of the IP, and is already generating new IP. In addition, Flow is continuously creating new IP around parallel processing.

Is Flow Computing publicly traded? / Can I invest in Flow Computing or is it listed on the stock market?
Flow Computing is a privately held company and is not publicly traded. We are currently backed by venture capital and private investors. While individual investment opportunities are not available at this time, we welcome interest from qualified institutional investors and strategic partners.

Is Flow a chip company? Does it make physical PPUs?
Flow is a fabless semiconductor IP company. We do not manufacture chips or physical processors. Instead, we provide our PPU as soft IP. Our partners, such as chipmakers and system integrators, can integrate the PPU into their own silicon or SoC designs.

Where can I see performance benchmarks or technical documentation?
Performance benchmarks and technical documentation are available to qualified persons upon request. We share some high-level performance results publicly on our website, LinkedIn, and selected publications. For detailed technical access or evaluation opportunities, please contact our team directly here.

Where can I follow Flow’s latest updates?
Our News and Events pages are the best places to stay up to date with major announcements and where to find us next.

We also share updates on LinkedIn, Bluesky, Facebook, Instagram, and YouTube. But if you’d rather not miss a thing, our newsletter brings it all together: milestones, performance insights, and behind-the-scenes content, sent only when we have something worth sharing.
Subscribe here.

Didn't find what you're looking for?

Thank you for being awesome!

We appreciate you contacting Flow. Our team will get in touch with you soon! Have a great day!