Flow's Parallel Processing Unit solves one of the most fundamental computing dilemmas - parallel processing.
Core benefits of Flow's technology
100x performance boost
Flow's innovative Parallel Processing Unit (PPU) amplifies CPU performance by up to 100 times, ushering in an era of SuperCPUs.
Designed for full backward compatibility, the PPU enhances existing software and applications after recompilation. The more parallel the functionality, the greater the boost in performance.
Flow's technology even enhances the entire computing ecosystem. The CPU gains direct benefits, but ancillary components - matrix units, vector units, NPUs, and GPUs - also experience enhanced performance through the boosted CPU capabilities. All thanks to the PPU.
2X faster legacy software and applications
Flow’s PPU not only enhances legacy code without altering the original application, but also boosts performance when paired with recompiled operating system or programming system libraries.
The result? Substantial speed improvements across a wide array of applications, particularly those that display parallelism but are constrained by traditional thread-based processing. Our PPU unlocks the full potential of these applications, bringing significant gains in performance where previous architectures fell short.
Parametric design
The configurable and parametric design allows PPU to adapt to multiple uses. Everything can be tailored to meet specific requirements of multiple use cases. And we do mean everything. The number of PPU cores — 4, 16, 64, 256, or more. The type and number of functional units like ALUs, FPUs, MUs, GUs, and NUs. Even the size of on-chip memory resources – caches, buffers, and scratchpads, can be tailored to meet specific requirements.
The scalability in performance is directly linked to the number of PPU cores. A PPU with 4 cores is ideal for small devices like smart watches, a 16-core PPU is perfect for smartphones and a 64-core PPU offers excellent performance for PCs. For servers, a PPU with 256 cores is recommended, allowing them to handle the most demanding computing tasks with ease.
/ PPU
What is Parallel Processing Unit?
The Parallel Processing Unit (PPU) is an IP block that integrates tightly with the CPU on the same silicon. It is designed to be highly configurable to specific requirements of numerous use cases.
Customization options include:
- Number of cores in PPU (4, 16, 64, 256, etc.)
- Number and type of functional units (such as ALUs, FPUs, MUs, GUs, NUs)
- Size of on-chip memory resources (caches, buffers, scratchpads)
- Instruction set modifications to complement the CPU’s instruction set extension
CPU modifications are minimal, involving the integration of the PPU interface into the instruction set and updating the number of CPU cores in order to leverage new levels of performance.
Our parametric design allows extensive customization, including the number of PPU cores, the variety and number of functional units, and the size of on-chip memory resources. Your performance enhancement scales up with the number of PPU cores. A 16-core PPU is ideal for small devices like smart watches; a 64-core PPU fits well in smartphones and PCs; and a 256-core PPU is best suited for high-demand environments like AI, cloud and edge computing servers.
How is the 100 X Boost possible?
Here's how PPU solves the challenges around CPU latency, synchronization and virtual level parallelism. Our innovative and patented key in these technologies are implemented into PPU and together they will make up to 100x performance boost a reality.
1. Latency hiding
CURRENT MULTICORE CPU: Memory access, and especially shared access, represent a big challenge for multicore CPUs. Memory references slow down execution, intercore communication network causes additional latency. Traditional cache hierarchies cause coherency and scalability problems.
FLOW PPU: Latency of memory references is hidden by executing other threads while accessing the memory. No coherency problems since no caches are placed in the front of the network. Scalability is provided via a high-bandwidth network-on-chip.
2. Synchronization
CURRENT MULTICORE CPU: Usage of parallelism causes additional challenges. Due to inherent asynchronity of CPU's processor cores, synchronization of threads is required whenever there are inter-thread dependencies. These synchronizations are very expensive (taking 100 to 1000 clock cycles).
FLOW PPU: Synchronizations are needed only once per step since the threads are independent of each other within a step (dropping the cost down to 1). Synchronizations are overlapped with the execution (dropping the cost down to 1/100).
3. VIRTUAL ILP/LLP
CURRENT MULTICORE CPU: Suboptimal handling of low-level parallelism. Multiple instructions can be executed in multiple functional units only if instructions are independent. Pipeline hazards slow down instruction execution.
FLOW PPU: Functional units are organized as a chain where an unit can use the results of its predecessors as operands. Dependent code execution possible within a step of execution. Pipeline hazards are eliminated.
Performance boost for existing software & applications
Flow technology is fully backwards compatible with all existing legacy software and applications. The PPU's compiler automatically recognizes parallel parts of the code and executes those in PPU cores.
What’s more, Flow is developing an AI tool to help application and software developers identify parallel parts of the code and to propose methods of streamlining those for maximum performance.
Why better CPU performance is vital for future industries?
“While our investments in compute accelerators have transformed our customers’ capabilities, general-purpose compute is and will remain a critical portion of our customers’ workloads. Analytics, information retrieval, and ML training and serving all require a huge amount of compute power. Customers and users who wish to maximize performance, reduce infrastructure costs, and meet sustainability goals have found that the rate of CPU improvements has slowed recently. Amdahl’s Law suggests that as accelerators continue to improve, general purpose compute will dominate the cost and limit the capability of our infrastructure unless we make commensurate investments to keep up.”
Extract from Google announcement of its first Arm-based CPU on April 9th, 2024
Artificial intelligence
General purpose computing is and will remain a critical portion of numerous AI workloads. Analytics, information retrieval, and ML training and serving all requires massive amounts of computing power.
All parties wishing to maximize performance, reduce infrastructure costs, and meet sustainability goals have run against a slowing rate of CPU improvement. Unless CPUs can keep up, general purpose computing will limit the capability and dominate the cost of AI.
Autonomous vehicle systems
While we may not be manufacturing autonomous flying cars, the intricate technology behind such innovations requires immense parallel processing power. Exactly what the next generation CPUs enhanced by Flow's PPUs are designed to provide. They deliver the robust performance necessary for the high-speed, real-time data processing that autonomous vehicle systems demand.
Moreover, Flow's technology excels in edge computing environments where low latency is critical, ensuring that decision-making processes are as swift as they are reliable. By integrating Flow's PPU, autonomous systems gain the capability to react instantaneously to dynamic conditions – enhancing safety and efficiency.
New opportunities
Emerging fields such as simulation and optimization – widely used in business computing from logistics planning to investment forecasting – will greatly profit from Flow. Such applications tend to be heavily parallelized, and will greatly benefit from the flexibility of Flow technology over GPU thread blocks.
Flow technology also works for the classic numeric and non-numeric parallelizable workloads, from matrix or vector computations to sorting. Even in code with only small parallelizable parts – which are currently not parallelized because runtime overhead is larger than runtime benefit – Flow's PPU will nonetheless boost the overall performance.