Flow's Parallel Processing Unit solves one of computing's most fundamental dilemmas - parallel processing.

Core benefits of Flow's technology

100x performance boost

Flow's innovative Parallel Processing Unit (PPU) enhances CPU performance by up to 100 times, ushering in the era of SuperCPUs. 

Designed for full backward compatibility, the PPU boosts the performance of existing software and applications after recompilation. The more parallel the functionality, the greater the boost in performance. 

Flow's technology even enhances the entire computing ecosystem. While the CPU gains direct benefits, ancillary components—matrix units, vector units, NPUs, and GPUs—also see improved performance through the boosted CPU capabilities, all thanks to the PPU.

2X faster legacy software and applications

Flow’s PPU enhances legacy code without altering the original application and greater performance gains when paired with recompiled operating system or programming system libraries. 

The result? Significant speed improvements across a wide range of applications, especially those that exhibit parallelism but are constrained by traditional thread-based processing. Our PPU unlocks the full potential of these applications, delivering significant performance gains where previous architectures fell short.

Parametric design

The configurable, parametric design of the PPU allows it to adapt to diverse uses. Everything can be tailored to meet specific requirements of multiple use cases. And we do mean everything. The number of PPU cores—4, 16, 64, 256, or more— and the type and number of functional units, such as ALUs, FPUs, MUs, GUs, and NUs. Even the size of on-chip memory resources—caches, buffers, and scratchpads—can be tailored to suit specific requirements.

The performance scalability is directly linked to the number of PPU cores. A PPU with 4 cores is ideal for small devices like smart watches, a 16-core PPU is perfect for smartphones, and a 64-core PPU delivers excellent performance for PCs. For servers, a PPU with 256 cores is recommended, enabling them to handle the most demanding computing tasks with ease.

/ PPU

What is Parallel Processing Unit?

The Parallel Processing Unit (PPU) is an IP block that integrates tightly with the CPU on the same silicon. It is designed to be highly configurable to specific meet requirements of numerous use cases. 

Customization options include:

  • Number of cores in PPU (4, 16, 64, 256, etc.)
  • Type and number of functional units (such as ALUs, FPUs, MUs, GUs, NUs)
  • Size of on-chip memory resources (caches, buffers, scratchpads)
  • Instruction set modifications tailored to complement the CPU’s instruction set extensions

CPU modifications are minimal, involving the integration of the PPU interface into the instruction set and an update to the number of CPU cores in to unlock new levels of performance.

Our parametric design enables extensive customization, including the number of PPU cores, the variety and number of functional units, and the size of on-chip memory resources. Your performance enhancement scales with the number of PPU cores. A 16-core PPU is ideal for small devices like smart watches; a 64-core PPU is well-suited for smartphones and PCs; and a 256-core PPU is excels in high-demand environments such as AI, cloud, and edge computing servers.

PPU enables a new
era of SuperCPUs
/ PPU

100x CPU performance

How is the 100 X Boost possible?

Here's how PPU solves the challenges around CPU latency, synchronization and virtual-level parallelism. Our innovative and patented key in these technologies are implemented into PPU, and together they will make up to 100x performance boost a reality.

1. Latency hiding

CURRENT MULTICORE CPU: Memory access, and especially shared access, represent a big challenge for multicore CPUs. Memory references slow down execution, intercore communication network causes additional latency. Traditional cache hierarchies cause coherency and scalability problems. 

FLOW PPU: Latency of memory references is hidden by executing other threads while accessing the memory. No coherency problems since no caches are placed in the front of the network. Scalability is provided via a high-bandwidth network-on-chip. 

2. Synchronization

CURRENT MULTICORE CPU: Usage of parallelism causes additional challenges. Due to inherent asynchronity of CPU's processor cores, synchronization of threads is required whenever there are inter-thread dependencies. These synchronizations are very expensive (taking 100 to 1000 clock cycles).
 

FLOW PPU: Synchronizations are needed only once per step
since the threads are independent of each other
within a step (dropping the cost down to 1). Synchronizations are overlapped with the execution (dropping the cost down to 1/100).

3. VIRTUAL ILP/LLP

CURRENT MULTICORE CPU: Suboptimal handling of low-level parallelism. Multiple instructions can be executed in multiple
functional units only if instructions are independent. Pipeline hazards slow down instruction execution. 

FLOW PPU: Functional units are organized as a chain where an unit can use the results of its predecessors as operands. Dependent code execution possible within a step of execution. Pipeline hazards are eliminated. 

Performance boost for existing software & applications

Flow technology is fully backwards compatible with all existing legacy software and applications. The PPU's compiler automatically recognizes parallel parts of the code and executes those in PPU cores. 

What’s more, Flow is developing an AI tool to help application and software developers identify parallel parts of the code and to propose methods of streamlining those for maximum performance.

Current sw apps

Why better CPU performance is vital for future industries?

“While our investments in compute accelerators have transformed our customers’ capabilities, general-purpose compute is and will remain a critical portion of our customers’ workloads. Analytics, information retrieval, and ML training and serving all require a huge amount of compute power. Customers and users who wish to maximize performance, reduce infrastructure costs, and meet sustainability goals have found that the rate of CPU improvements has slowed recently. Amdahl’s Law suggests that as accelerators continue to improve, general purpose compute will dominate the cost and limit the capability of our infrastructure unless we make commensurate investments to keep up.” 

Extract from Google announcement of its first Arm-based CPU on April 9th, 2024

Artificial intelligence

General purpose computing is and will remain a critical portion of numerous AI workloads. Analytics, information retrieval, and ML training and serving all requires massive amounts of computing power. 

All parties wishing to maximize performance, reduce infrastructure costs, and meet sustainability goals have run against a slowing rate of CPU improvement. Unless CPUs can keep up, general purpose computing will limit the capability and dominate the cost of AI.

Learn more

Autonomous vehicle systems

While we may not be manufacturing autonomous flying cars, the intricate technology behind such innovations requires immense parallel processing power. Exactly what the next generation CPUs enhanced by Flow's PPUs are designed to provide. They deliver the robust performance necessary for the high-speed, real-time data processing that autonomous vehicle systems demand.
 
Moreover, Flow's technology excels in edge computing environments where low latency is critical, ensuring that decision-making processes are as swift as they are reliable. By integrating Flow's PPU, autonomous systems gain the capability to react instantaneously to dynamic conditions – enhancing safety and efficiency.

Learn more

New opportunities

Emerging fields such as simulation and optimization – widely used in business computing from logistics planning to investment forecasting – will greatly profit from Flow. Such applications tend to be heavily parallelized, and will greatly benefit from the flexibility of Flow technology over GPU thread blocks. 

Flow technology also works for the classic numeric and non-numeric parallelizable workloads, from matrix or vector computations to sorting. Even in code with only small parallelizable parts – which are currently not parallelized because runtime overhead is larger than runtime benefit – Flow's PPU will nonetheless boost the overall performance.

Learn more

Contact usX