Interested to discover further details about the Flow & Flow Parallel Processing Unit (PPU)?

/FAQ

Our FAQ provides plenty of information about the Flow and Flow Parallel Processing Unit

If you are curious about Flow Parallel Processing Unit (PPU) or Flow in general, check out our extensive FAQ. It includes more information, for example, about the benefits of Flow PPU and its key design principles + characteristics. There is also more background on the company and its business. 

Please ping us at info@flow-computing.com in case there is a certain piece of information that you would like to know, but it is not covered in our FAQ. 

Benefits of Flow Parallel Processing Unit (PPU)

What problem is Flow trying to solve?
We believe that there have been only incremental improvements in CPU performance during recent decades. In our opinion, it has led to a situation where the CPU has actually become the weakest link in computing, due to its sequential architecture that is suboptimal for modern workloads. A new era in CPU performance has become a necessity to meet the continuously-increasing demand for more computing performance (driven to a large extent by needs in AI, edge and cloud computing). We intend to lead this revolution through our radical new Flow Parallel Processing Unit (PPU) architecture, enabling significant performance boosts in parallel functionalities. Flow PPU can be applied to any CPU architecture and still maintain full backward software compatibility.

What are the core benefits that Flow offers better than its competitors?
Flow’s architecture is referred to as a Flow Parallel Processing Unit (PPU). It boosts CPU performance significantly to enable a totally new era of CPU performance. Flow PPU is fully backward compatible with existing software and applications - the parallel functionality in legacy software and applications can be accelerated by recompiling them for the Flow PPU, even without any code changes. The more parallel functionality there is, the greater the resulting performance boost. Our technology is also complementary in nature - while it boosts the CPU, all other connected units (such as matrix units, vector units, NPUs, and GPUs) will indirectly benefit from the performance of Flow PPU, and get a boost from the more capable CPU. 

Can a GPU benefit from PPU architecture?
GPUs will most definitely take advantage (indirectly) of PPU usage in the CPU. While CPU performance is greatly improved by Flow PPU, most computations between the CPU and GPU will also benefit from this improvement. The overall performance of CPU + GPU configurations will significantly benefit from the PPU.

What types of devices benefit most from Flow PPU architecture?
All devices that require high-performance CPUs will benefit greatly from Flow PPU's licensing.

What are the core characteristics and design principles of a Flow Parallel Processing Unit (PPU)?

To highlight the hardware advantages of Flow over traditional SMP/NUMA CPU (and GPU) computing, let’s take a closer look at several key differences between them:

A. Non-existent cache coherence issues. Unlike in current CPU systems, in our architecture there are no cache coherence issues in the PPU’s memory system due to the memory organization excluding caches at the front of the intercommunication network.

B. Cost-efficient synchronization. Our synchronization cost is roughly 1/Tc (Tc=fibers per Flow PPU core), whereas in SMP/NUMA CPU systems it can range from hundreds to thousands of clock cycles, and in GPUs from thousands to hundreds of thousands of clock cycles, i.e., synchronization inside Flow PPU is almost cost-free.

C. Support for parallel computing primitives. Our architecture provides unique and specific techniques/solutions for executing concurrent memory access operations (both read and write), multi-operations for executing reductions, multi-prefix operations, compute-update operations, and fiber mapping operations in the most efficient manner possible. These primitives are not available in current CPUs. Implementing them in Flow PPU involves active memory technologies at the SRAM-based on-chip shared cache level, potentially providing better performance and greater flexibility than current DRAM-based processing-in-memory (PIM) solutions, while still allowing both techniques to be used in the same design.

D. Flexible threading/fibering scheme. Our technology allows for an unbounded number of fibers at the model level, which can also be supported in hardware (within certain bandwidth constraints). In current-generation CPUs, the number of threads is, in theory, not bound, but if the number of hardware threads is exceeded in the case of interdependencies, performance can be slower than in a single-threaded version. In addition, operating systems typically limit the number of threads to a few thousand at most. The mapping of fibers to backend units is a programmable function allowing further performance improvements with a Flow PPU.

E. Low-level parallelism for dependent operations. In Flow PPU-enabled CPUs, it is possible to execute dependent operations with full utilization within a step (enabled by the chaining of functional units), whereas in current CPUs, the operations executed in parallel must be independent due to the parallel organization of their functional units.

F. Non-existent context switching costs. In Flow PPU-enabled CPUs, fiber switching has zero cost, whereas in current CPUs, context switching typically takes over 100 clock cycles.

G. Intercommunication traffic congestion avoidance. In Flow PPU-enabled CPUs, the probability of intercommunication network traffic congestion is low due to hardware support for hashing, concurrent memory access, multi-operations, and multi-prefix operations. In current CPUs, congestion can occur frequently if access patterns are non-trivial.

H. Scalable latency tolerance. Our technology provides a scalable latency-hiding mechanism for constellations of up to thousands of cores, whereas in current CPUs, latency tolerance (with snooping/directory-based cache coherence maintenance mechanisms) appears to scale poorly. There is also evidence that in high-end Flow PPU-enabled systems, even the latency of DRAM-based memory systems can often be hidden with suitable memory organization and sufficient bandwidth.

I. No need for locality-maximizing memory data partitioning. Flow PPU-enabled CPUs work well with all kinds of memory partitioning schemes, whereas the performance of current CPUs is highly sensitive to schemes not maximizing the locality of references.

J. Sufficient intercommunication bandwidth. Flow PPU has an intercommunication network that is designed to provide sufficient bandwidth for random communication, whereas current CPUs are limited only to cases where most references are local. This kind of locality maximization is not always possible, as there is no general algorithm to maximize locality.

K. Dual unit organization. Current multicore CPUs were built by replicating processors originally designed for sequential computing and are therefore optimized for low latency. As a result, they perform relatively well on executing sequential workloads but have substantial performance issues with non-trivial parallel functionalities.

To support high-speed parallel code execution, we introduce the Flow PPU, which leverages parallel slackness to reorganize operations, shifting the need for low-latency to a need for high throughput, and integrating it with a CPU. The resulting dual-unit CPU-Flow PPU architecture combines the best of both worlds to achieve the best performance for modern workloads containing a lot of parallelism, but also some sequential parts, while maintaining backward compatibility via the CPU.

L. Minimal disadvantages of superpipelining while retaining all benefits, including full support for long latency, floating-point and application-specific operations. Flow PPU is a fully superpipelined design with a regular structure and patented support for long-latency operations, floating-point operations, and optional application-specific operations. In current CPUs, superpipelining increases the degree of pipeline delays, which may cannibalize the performance benefits.

M. Parametric design and instruction set independence. We are not limited to single instances only, as it features parametric blocks with design-time-adjustable number of CPU cores, number of Flow PPU cores, number and types of functional units per Flow PPU core, size and organization of step caches, scratchpads, on-chip shared caches, latency compensation unit length, instruction set, etc. Current CPU designs are typically tied to certain instruction sets and may require partial redesign if these kinds of parameters are altered.

N. Support key patterns in parallel computing. We support the key patterns of parallel computation, such as parallel execution, reduction, spreading, and permutation. Current-generation CPUs only support the parallel execution pattern without slowdown.

Flow Parallel Processing Unit (PPU) compared to other chip units and solutions

How does Flow PPU compare to the existing chip units inside the die?
The blocks inside the CPU die are optimized and meant for different purposes - vector units for vector calculation, matrix units for matrix calculation, etc. Flow Parallel Processing Unit is optimized for parallel processing.

On-die GPUs and NPUs are also used for accelerating predefined use cases. NPUs in particular have very limited use and cannot be used for general-purpose computation. GPUs are more versatile than NPUs but they can only effectively process tasks with easy memory access patterns and synchronization requirements.

Programming the Flow PPU is part of the CPU code. Other on-die units need special programming and developers trained for those architectures.

What are the key differences between Flow PPUs and GPUs?
Flow PPU is optimized for parallel processing, while the GPU is optimized for graphics processing. Flow PPU is more closely integrated with the CPU, and you could think of it as a kind of coprocessor, whereas the GPU is an independent unit, loosely connected to the CPU.

Improving the CPU’s parallel processing capability with Flow PPU brings significant benefits. In GPU programming, the width of parallelism is fixed within a kernel, while the width of Flow PPU can vary. This flexibility helps avoid the inefficiencies often seen in GPU kernels. Starting a kernel involves some overhead, i.e. there is a minimum amount of work before outsourcing to the GPU is profitable. In contrast, Flow PPU works with a significantly wider range of programs because it can be utilized as an integral part of the code without creating a separate kernel.

Don’t GPUs already provide parallel processing capabilities in both the rasterization and geometry pipelines? Why add them to the CPU?
Improving the CPU’s parallel processing capability with Flow PPU brings significant benefits. In GPU programming, the width of parallelism is fixed within a kernel, while the width of Flow PPU can vary. This often brings inefficiencies to GPU kernels, which are avoided in Flow PPU architecture. Starting with a kernel involves some overhead. There is a minimum amount of work before outsourcing to the GPU becomes profitable. In contrast, Flow PPU works for a significantly wider range of programs because Flow PPU can be utilized as an integral part of the code, without requiring the creation of a separate kernel. Moreover, CPU and GPU memories are normally separate, which leads to memory consistency challenges.

What kind of Flow PPU would need to be added to a CPU to equal the performance of a high-end GPU?
Our goal is not to replace the GPU, but to improve the performance of the weakest link of computation: the CPU. CPUs powered by Flow PPU boost performance, enabling what we call SuperCPUs, and improving the performance of the entire system, including GPUs. CPUs and future CPUs are primarily designed for different functionalities than GPUs. When the functionality requires non-trivial access patterns or contains inter-thread dependencies, Flow PPU-enabled CPUs will be much faster than GPUs.

GPU vs. CPU is a bit beside the point. The more interesting reference point is the current CPU and the next-generation SuperCPU, powered by Flow PPU. NVIDIA does have its own Grace CPU Superchip with 144 cores (that is 2 x 72 cores). If a comparable system would have a CPU with 72 cores, coupled with 64 cores of PPU, it would likely have much better performance than the current Grace CPU. Integrating a system like Grace CPU with Flow PPU, coupled to configurations with powerful GPUs, like the current Blackwell and the future Rubin series, would raise the performance bar tremendously.

In what way does Flow differ from, and offer advantages over, architectures that combine a CPU and GPU on a single silicon chip?
Flow PPU provides better utilization of compute resources than GPUs because, with Flow PPUs, the amount of parallelism can be dynamically set to follow the optimum level, whereas in GPUs it is more or less fixed. Processing in the Flow PPU starts immediately as part of a CPU program, whereas in GPUs, a kernel must be launched and executed outside the CPU. Starting a kernel involves some overhead, i.e., there is a minimum amount of work before offloading to the GPU becomes profitable. Moreover, CPU and GPU memories are normally separate, which necessitates explicit data transfers or implicit synchronization overhead in the mapped memory regions.

In GPU programming, the width of parallelism is fixed within a kernel, while the width of Flow PPU can vary. This causes inefficiencies in GPU kernels that are avoided in Flow PPU. Starting a kernel involves some overhead, i.e., there is a minimum amount of work before offloading to the GPU becomes profitable.

In contrast, Flow PPU can directly start within program execution, which enables its use for smaller workloads and for a wider range of programs. Moreover, CPU and GPU memories are normally separate, which necessitates explicit data transfers or implicit synchronization overhead in mapped memory regions.

Unifying CPU and GPU memory into a single physical memory will thus bring its own challenges, both in hardware design and in programming. As an example, the question of when memory writes made by the CPU become visible to the GPU, and vice versa, leads to memory consistency models that normally complicate reasoning about program correctness, or leads to a safe but inefficient programming style. To summarize, programming for CPU+Flow PPU is more comfortable than programming for CPU+GPU, with significantly greater flexibility.

Some CPU accelerators introduce extra delays in computation. Does the PPU suffer from this problem?
Flow PPU is tightly connected to the CPU. There will be minor latency in passing parameters to the PPU, but it rarely causes delays due to the natural overlapping of PPU and CPU operations. Even if there were a minimal delay, the performance gains of the PPU would eliminate the potential minor slowdown by a large margin. Sequential legacy code will be executed by the CPU without PPU involvement, thus latency remains unchanged in this case.

When Flow PPU executes the parallel parts of the code, it takes a different approach to latency than the CPU, rather than minimizing the latency of individual instructions, it exploits the slack of parallelism to maximize throughput. This is e.g., used to hide the latency of memory operations by executing other threads while accessing memory. The need for cache coherence maintenance traffic is eliminated by placing no caches at the front of the intercommunication network. Scalability is provided via a high-bandwidth network-on-chip, ultimately supporting the memory access needs of general parallel computing.

Use cases for Flow Parallel Processing Unit (PPU)

Is there a theoretical limit to what kind of Flow PPU can be added to a typical mobile, PC, or server CPU?
Flow PPU is parametric. It can be configured for any desired use case: number of Flow PPU cores (e.g., 16, 64, 256...), number and type of functional units (e.g., ALUs, FPUs, MUs, GUs, NUs), size of on-chip memory resources (caches, buffers, scratchpads), etc. The performance boost scales up with the number of PPU cores: for very small devices (e.g., a smartwatch), a Flow PPU with 4 cores would work well; a Flow PPU with 16 cores would be suitable for smartphones and laptops; a Flow PPU with 64 cores would work well with desktop computers, and a Flow PPU with 256 cores would likely be the most suitable configuration for AI and edge computing servers.

Can Flow PPU really be used in anything from a mobile phone to a supercomputer?
Yes, because Flow PPU is both configurable and parametric, it suits a wide range of use cases. The number of Flow PPU cores (e.g., 4, 16, 64, 256...), the number and type of functional units (e.g., ALUs, FPUs, MUs, GUs, NUs), and the size of on-chip memory resources (caches, buffers, scratchpads) are all parametric.

The performance boost scales up with the number of Flow PPU cores: for very small devices (e.g., a smart watch) a Flow PPU with 4 cores is highly suitable; 16 cores for smartphones and laptops; 64 cores for PCs; and a Flow PPU matrix with 256 cores or more would likely be the most suitable configuration for a supercomputer.

What are the use cases for Flow PPU in AI?
Data pre- and post-processing currently accounts for up to 50% of the total time when an LLM is trained for a new language. This can be significantly reduced by high-performance, Flow PPU-powered CPUs. In addition, locally-hosted AI would become far more feasible. Many AI problems are parallel in nature, thus improved parallel processing performance could make a significant impact.

What are the use cases for Flow PPU in supercomputers and the defense industry?
Our technology can dramatically improve standard supercomputer performance! In addition, Flow PPU can be configured to REDUCE power consumption! Due to the parametric nature of Flow PPU, the uptake in performance can be used to save power used for processing, so a 100x performance boost could be traded for a 10x performance boost with 10x less power consumption.

In the defense industry - missiles, drones, and missile- and drone-defense devices are the most lucrative use cases, alongside military aviation. Whoever processes the data and calculates it the fastest will win in warfare. Flow has a major geopolitical and defense impact.

What is the estimated performance benefit of using Flow PPU technology? In particular, how are the parallel computing performance gains likely to translate into improvements at the full application level?
The question of performance benefits is answered separately for software that is specifically written with our technology in mind, and legacy software. For the latter, the availability of source code allows programmers to re-compile with a compiler that is aware of our technology, and can provide performance improvements that are detected automatically by Flow’s compiler. A definite advantage is the possibility to run “classic code” on the CPU cores and exploit parallelism on the PPU cores whenever it occurs in an application.

Think of it as similar to the original PowerPC chip in 1990s Macs. Older 68K software ran in compatibility mode, while the new software ran much faster using the PowerPC instruction set. A further performance benefit for legacy code can be obtained if an operating system or programming system libraries (e.g., sorting via qsort() in the C library) can be ported to utilize Flow and then run faster on Flow PPU cores, even if the application code itself is unmodified. There will be significant performance gains for most types of applications, especially those that exhibit degrees of parallelism but cannot be parallelized with current thread-based schemes.

What types of algorithms does Flow PPU technology likely work for, and how widely can it be applied? Are there some specific areas it is unlikely to work in?
Fields such as numeric and combinatorial simulation and optimization, which are widely used in business computing (from logistics planning to forecasting investments), will greatly profit from Flow. Such applications tend to be heavily parallelized, often for GPU clusters, and will benefit from the flexibility of thick control flows over GPU thread blocks. Our technology also works for classic numeric and non-numeric parallelizable workloads ranging from matrix or vector computation to sorting. In code with small parallelizable parts that are not parallelized because the runtime overhead is larger than the runtime benefit from parallelization, we still bring a performance boost. 

A growing field in which our technology is highly applicable is artificial intelligence. Both machine learning (e.g., training neural networks) and symbolic AI (often searching through large graphs) can benefit from Flow, as these applications currently are often used on GPUs but require CPU involvement in pre- and post-processing and can handle irregular patterns. In addition, the regularity requirement in GPU computing limits parallelization on GPUs, something which can be overcome by the flexibility of Flow PPU.

Adaptation of Flow Parallel Processing Unit (PPU)

Is Flow PPU's architecture dependent on state-of-the-art manufacturing processes?
Flow PPU can be integrated into any current or pending design architecture or silicon process.

Is Flow foundry- or architecture-dependent?
Flow is completely foundry- and architecture-independent. No changes to tools or processes are required. It can also be used with all instruction sets.

What is the applicability of Flow PPU’s technology? How widely could Flow PPU be adapted?
Flow’s technology is adaptable to any microprocessor-based system that currently uses single and multiple processor cores and/or massively parallel accelerator devices such as GPUs, as the CPU cores handle the functionality with limited parallelism and the Flow PPU cores take on the role of the accelerator. In that case, Flow PPU cores can take over work from the CPU processor cores for handling parallel parts, but that is not currently outsourced to the accelerator because the parallel work is either too irregular or too small to justify the overhead.

Thus, the technology becomes widely adaptable, ranging from traditional desktop and laptop computers to embedded systems and digital signal processors, to smartphones, where part of the workload is not from the user (such as video decoding), but is generated from the system itself (e.g., software parts of the radio stack, which are numeric, such as digital signal processing applications, or at least high-throughput, such as network stack processing). Current microprocessor-based systems face the unenviable situation where operation frequencies can no longer be increased, so performance improvements for a single application must come from the use of parallelism, rather than from faster operation frequencies. Furthermore, as memory devices do not become correspondingly faster, access to main memory is quite slow when measured in processor core cycles (hundreds of cycles). As a result, main memory access must be avoided as much as possible through the use of fast but small caches to store frequently accessed data items.

Efficient cache use necessitates a programming style that leads to particular memory access patterns, which often hinders parallelization. Our technology brings advantages in both areas: the Flow PPU allows a more flexible, and therefore more widespread, use of parallelization compared to parallel threads, thereby supporting programmability and programmer efficiency. Emulated shared memory hides long latencies instead of avoiding them, by exposing them to the CPU’s processor cores and allowing the cores to better schedule other tasks for execution while Flow PPU waits for a memory read.

When is Flow announcing the availability of this IP platform in its entirety?
We exited stealth in June 2024 with the announcement of our incorporation, funding, and the basic details of its patented Flow PPU architecture. We are still developing its IP platform and product further, so stay tuned for our future progress and full details of our technical innovations. Companies that commit to early access to the technology will naturally receive more technical details early on.  

Are there any fabs currently aligned with Flow? If not, which are most suited to deploying this? 
Not at the moment. We are totally fabless and foundry- and architecture-independent: no changes to tools or processes are required. Also, it can be used with all instruction sets. Thus, CPUs with Flow PPU can be deployed by any fab.

How does Flow provide its IP to a licensee? VHDL source, source code, final design schematics, compiled software?
Our target is to provide our IP to the RISC-V market as soft IP, i.e., synthesizable HDL. For the ARM and x86 markets, we will offer architecture-type licensing, which allows licensees to use our patents and other IP to implement their own Flow PPU.

Does Flow work with LLVM or support custom compilers?
Yes, our software stack is built on top of the LLVM compiler infrastructure, and we are developing custom extensions to support Flow PPU architecture. This allows us to integrate with existing toolchains and makes parallel programming more accessible to developers familiar with standard languages. As our technology matures, we plan to provide further documentation and tooling to support developers and partners.

Can developers or researchers access Flow’s technology or tools?
We are currently collaborating with selected partners during its early access phase. Broader availability of our development tools, documentation, and evaluation environments is planned as we approach commercial release. If you're a developer, researcher, or organization interested in testing or integrating our technology, we encourage you to get in touch.

Size, cost, and power consumption estimates of the Parallel Processing Unit (PPU)

How much die space does Flow PPU require to achieve significant performance over standard architectures?
It depends on the system configuration. In systems with a high number of processor cores, it is expected that several CPU cores could be substituted with Flow PPU. Flow PPU uses leftover die space without requiring any extra silicon area.

Our initial, very rough silicon area estimation model is based on legacy silicon technology parameters and public scaling factors. For a 64-core Flow PPU achieving a 38x - 107x speed-up in laboratory tests, the initial estimated silicon area is 21.7 mm² area in a 3nm silicon process. For a 256-core Flow PPU achieving a 148x - 421x speed-up, the estimated area is 103.8 mm².

How much additional power draw does a pipeline of Flow PPU typically require?
Flow PPU can actually be configured to REDUCE power consumption! Due to its parametric nature, performance can be traded for lower energy use. For example, a 100x performance boost could be traded for a 10x performance boost with 10x lower power consumption.

Power draw depends heavily on the desired configuration. Our initial power consumption estimation model (based on legacy silicon technology parameters and public scaling factors), indicates 43.4 W consumption for a 64-core Flow PPU delivering a 38X - 107x speedup in laboratory tests if a 3nm silicon process were used, and 235 W consumption for a 256-core Flow PPU delivering a 148x - 421x speedup.

What are the tradeoffs between maintaining full backward software compatibility with existing architectures (e.g., x86, Arm, Power) vs. maximizing performance?
ALL existing software is compatible with CPUs that have Flow PPU matrices built in. The level of performance boost depends on the amount of parallelism in the software - the more parallelism, the more boost Flow PPU will generate without any software code changes using only recompiling. If the libraries are already optimized for Flow PPU, even more performance gains are achieved without any additional steps.

For maximum performance gains, it is possible to refactor the critical parts of the code or rewrite it entirely as native parallel. We will develop AI/smart compiler tools to help companies identify which parts of their software can be parallelized.

Business and markets for Flow

What are some major clients Flow is currently working with?
We are in positive discussions with leading-edge CPU companies such as AMD, Arm, and Intel to co-develop the future era of advanced CPU computing starting with server CPUs. We are also interacting with companies looking into the server CPU market, like Qualcomm and SiFive and others using the open-source RISC-V instruction set.

Are there any target markets / products that Flow plans to especially address? 
Our unique Flow PPU architecture excels in general-purpose parallel computing and in the most demanding applications such as locally-hosted AI. We can also turbocharge server/cloud CPUs in data centers for uses such as edge and cloud computing, AI clouds, and more.

What is Flow’s business model, and what makes it almost unique among semiconductor companies?
Our business model is to license our technology, just as Arm does, to various licensees around the world. Flow PPU is totally independent of instruction set design, so it can be used in any modern CPU and integrated into any current or pending design architecture using any silicon process.

Who are the company’s primary competitors? 
Since Flow PPU is a unique, one-of-a-kind product that we have invented, patented and trademarked, we do not have any direct competitors. In a sense, our biggest competitor is the fact that the industry and CPU manufacturers keep using the current ways of improving their performance - e.g., adding more processor cores, using smaller feature sizes and increasing the clock frequency instead of utilizing new alternatives, offering FAR higher performance.

What CPU vendors are most likely to consider licensing the Flow PPU architecture and why?
Our vision is for Flow PPU to be used in all future high-performance CPUs. The most likely CPU vendors to license Flow  PPU are leading-edge CPU companies such as Arm, AMD, Intel, Apple, NVIDIA, Samsung, and Qualcomm. We also see strong potential for hyperscaler companies that use custom CPUs for specific use cases requiring the utmost from CPU performance. Most RISC-V CPU companies are successfully developing these types of custom CPUs.

Why would compute powerhouses like AMD, Apple, ARM, Intel, NVIDIA, and Qualcomm ever consider licensing Flow PPU when they have already invested billions in their own designs?
These companies are constantly looking for breakthrough technologies to improve their products. The technologies used in current multicore CPUs cannot be used to solve the inefficiencies that arise when executing parallel functionalities. We are developing a unique product with Flow PPU, one that will enable these computing powerhouses to step into a new era of CPU performance. Flow PPU is complementary in nature, benefitting all CPU computations and instruction sets. It is usable in independent fitting for all existing fabs, foundries, tools, and processes, so Flow PPU fits in easily into their own designs. We are now looking to engage with these leading-edge CPU companies to co-develop the future era of CPU computing. Parallel portions of code can be expressed as natural parallel statements without concerns about race conditions, deadlocks and synchronization.

Licensing a core architecture from a new, unknown startup is a BIG ask for a company like Intel (or any other one, for that matter). How can Flow assure a potential licensee that the architecture will be around for the foreseeable future?
This is a legitimate concern. We plan to increase resources and scale-up capabilities, such as technical customer support and other variables, to provide rock-solid assurances to our customers. Our investors are fully committed to backing us in our growth journey, and we are obviously looking to bring new investors on board in future financing rounds. With positive market traction, we are certain of our ability to continue growing to meet the rigorous demands of our licensees. Customers who license our architecture will have full technical control over their design.

It all sounds great - but if it’s so amazing, why haven’t the multi-billion-dollar chip companies already done it? What’s the catch?
Parallel processing is not a “new kid on the block” in computing. It’s actually been around conceptually since the 1970s. It was never mainstream in these early days of the PC due to its programming complexity and inefficient architecture. As a result, architectural selections in the past led to current multicore CPUs replicating processors originally optimized for sequential computing.

Over time, the industry has learned to settle for incremental performance gains achieved by adding cores and clock frequency. Moreover, the industry has become complacent with cumbersome and unproductive programming techniques. We have been driven by this and have meticulously researched and developed parallel processing technology. With the PPU and our technology stack, we can finally combine all the benefits of current CPUs and parallel processing.

Flow as a company

How long has Flow been in business as an independent company?
The company was established in January 2024 as a spin-off from VTT Technical Research Center of Finland. 

How much money has the company raised to date? 
In total, Є4M.

What round of funding is the company currently in? 
Flow closed the first funding round (pre-seed) on January 31st, 2024.

How many people does Flow employ?
The company has three co-founders and has hired the best pan-European industry talent to work on the first commercial version of PPU and its compiler. The current headcount is twelve people.

Which angels and venture firms have invested in Flow to date? 
Butterfly Ventures (Finland), Sarsia (Norway), Stephen Industries (Finland), Superhero Capital (Finland), and FOV Ventures (Finland/UK).

What were some of the key reasons VCs decided to invest in Flow? 
Our investors were especially excited about the innovativeness and uniqueness of Flow’s technology, its strong IP portfolio, and the potential to enable a new era of superCPUs for the AI revolution.

Who are Flow's founders and what makes them qualified to create all of this?
Flow was founded by Dr. Martti Forsell, Jussi Roivainen, and Timo Valtonen. Dr. Forsell has been researching parallel processor architectures and programming for several decades, first at the University of Joensuu (now the University of Eastern Finland) and later at the VTT Technical Research Center of Finland, where Jussi Roivainen joined his research team. Timo Valtonen joined the team to drive and plan the commercialization of the research and technology.

The initial idea was to develop the fastest CPU in the world! Alongside this original idea, the team started in parallel (bad pun, sorry) to explore the possibility of creating a product that could be used by all CPU manufacturers. Flow PPU and Flow were born from this, alongside the vision that Flow PPU would be used in all high-performance CPUs, ushering in a new era in CPU computing. Years of joint work have led to this point, and the founders now have a funded company to fulfill that vision.

Who are Flow's advisory team members? 
Flow doesn't have an advisory team yet. We are currently evaluating candidates to form an advisory team later this year, balanced between marketing and technical expertise.

Does Flow plan to offer regional offices in other geographies? How long has Flow been in business?

Flow Computing is headquartered in Finland, and the current team is spread across Europe, in several European countries. In the future, our plan is to have an office in the USA.

Where was Flow’s original design architecture conceptualized? How long has Flow been in business?

Dr. Martti Forsell began researching parallel processor architectures and parallel computing at the University of Joensuu (now the University of Eastern Finland). He continued his research at VTT Technical Research Center of Finland, together with Jussi Roivainen. Flow Computing’s technologies and patents are the result!

Does VTT Research still own a stake in the company or rights to IP? 
VTT is a minority owner of Flow Computing after a significant IP contribution to the company. Flow has full rights and ownership of the IP, and is already generating new IP. In addition, Flow is continuously creating new IP around parallel processing.

Is Flow Computing publicly traded? / Can I invest in Flow Computing or is it listed on the stock market?
Flow Computing is a privately held company and is not publicly traded. We are currently backed by venture capital and private investors. While individual investment opportunities are not available at this time, we welcome interest from qualified institutional investors and strategic partners.

Is Flow a chip company? Does it make physical PPUs?
Flow is a fabless semiconductor IP company. We do not manufacture chips or physical processors. Instead, we provide Flow PPU as a soft IP. Our partners, such as chipmakers and system integrators, can integrate Flow PPU into their own silicon or SoC designs.

Where can I see performance benchmarks or technical documentation?
Performance benchmarks and technical documentation are available to qualified parties upon request. We share some high-level performance results publicly on our websiteLinkedIn, and selected publications. For detailed technical access or evaluation opportunities, please contact our team directly here.

Where can I follow Flow’s latest updates?
Our News and Events pages are the best places to stay up to date with major announcements and where to find us next.

We also share updates on LinkedInBlueskyFacebookInstagram, and YouTube. But if you’d rather not miss a thing, our newsletter brings it all together: milestones, performance insights, and behind-the-scenes content, sent only when we have something worth sharing.
Subscribe here.

Didn't find what you're looking for?

Contact us with your question. 

Thank you for being awesome!

We appreciate you contacting Flow. Our team will get in touch with you soon! Have a great day!

Close

Contact usX