Flow Computing Podcast Series: Prof. Dr. Jörg Keller on the PPU (Episodes 1-6)

News – 10/12/24

The Parallel Processing Unit (PPU) represents a revolutionary step forward in computing, addressing challenges in latency, energy efficiency, and parallelization. In this special compilation of Episodes 1-6 of the Flow Computing Podcast, Dr. Jörg Keller shares expert insights into the groundbreaking advancements the PPU brings to the industry. From handling legacy software to real-world applications and hardware diversification, this series covers it all.

Here’s a quick preview of what you’ll learn in each episode:

Episode 1: Discover how PPU cores simplify parallelization with lower overhead and energy consumption compared to traditional CPU cores.
Episode 2: Learn how Flow Computing’s compiler and binary-to-binary translator unlock performance improvements for both legacy and modern software.
Episode 3: Explore real-world use cases, from high-definition video processing to synthetic video production, where the PPU excels.
Episode 4: Understand how the PPU seamlessly integrates with existing CPU architectures, simplifying adoption and enhancing performance.
Episode 5: Dive into the PPU’s approach to tackling latency, decoupling memory access delays from application reaction times.
Episode 6: See how the PPU addresses hardware diversification challenges, paving the way for efficient, scalable, and accessible computing solutions.

Watch the full video series below to dive deeper into the PPU’s groundbreaking impact!

Transcript

00:00 Introduction

Antti Mäkelä: Welcome to the Flow Computing Podcast. Today, we'll dive deeper into Flow Computing's PPU technology. Joining us today is Professor Dr. Jörg Keller, professor with the Faculty of Mathematics and Computer Science in FernUniversität in Hagen. Professor Keller has done the technical due diligence on Flow’s Parallel Processing Unit (PPU). Jörg, welcome tho the show!

JÖRG KELLER: Hello, it's good to be here. Let's talk!

00:39 Episode 1: Introduction to Flow's PPU Technology

What is PPU and why is it such a game-changer for computing power?

Multi-Core Processors and Evolution of Cores

JÖRG KELLER: Today, we have multi-core processors—processors with a larger number of cores to increase computing performance.

JK: For quite some time, these computing cores have all been the same. Then, some time ago, manufacturers started to introduce processors with different types of cores, like ARM's big.LITTLE technology.

Still, these cores have remained independent of each other.

Introduction to Flow's PPU Technology

JK: Now, when we have code that can be parallelized, it would be nice to have a number of cores that basically do the same thing, and do it without the overhead caused by each core operating independently.

This is where Flow's technology and the PPU cores come in.

How PPU Cores Reduce Overhead

JK: Next to the standard CPU cores, there is a set of PPU cores that largely perform the same tasks at certain times.

Thus, they can speed up some code without each having to fetch their program for themselves, because they all do the same and don’t need to maintain the structures that normal CPU cores require to run a program.

So, we have an easy way to parallelize with less overhead.

Efficiency and Power Consumption of PPU Cores

JK: Moreover, PPU cores can be smaller than a typical CPU core, so their power consumption is also lower.

02:47 Episode 2: Performance Gains

How does the PPU benefit both new and legacy software, particularly when source code is available?

Recompiling Original Code for Performance Gains

JÖRG KELLER: If we have the original code, this allows programmers to recompile with a compiler that is aware of Flow Computing's technology.

The compiler can then detect possibilities for performance improvements and automatically turn them into executable code.

Fine-Tuning Source Code
JK: The programmer also can fine-tune the source code with additional time and effort.

Binary-to-Binary Translation for Legacy Executables
JK: If we don’t have the source code, we can still use a binary-to-binary translator that searches the original executable for code patterns that can be transformed into faster code when we have PPU cores.

Enhancing Performance Through Libraries and Operating Systems
JK: A further benefit for legacy codes can be obtained if operating systems or programming libraries are recompiled for Flow Computing.

Then, those libraries will run faster on the PPU cores, even if the application code itself remains unmodified and just calls the improved libraries.

04:04 Episode 3: Use Cases of PPU

Can you give us examples of real-world use cases where the PPU's flexibility delivers significant performance improvements, even beyond AI applications?

The Role of Video in Modern Applications

JÖRG KELLER: Applications that are run nowadays quite frequently involve video.

So, the easiest thing might be video decoders, used when we play a pre-recorded video.

Parallelism in Video Decoding
JK: Those decoders typically exhibit a sufficient amount of parallelism, which is often either not exploited or not exploited very well.

On the other hand, a growing fraction of videos are going from high definition to 4K resolution, and we are also seeing higher frame rates.

So, we need more processing performance, and this could be delivered by parallelization via PPUs without the need for energy-hungry processes.

Synthetic Video Production

JK: The other side is the production of synthetic video from scripts, for example, presenting avatars or explaining concepts to people without pre-recorded videos.

So, the video production also needs computing performance in real-time, which could be provided through PPU cores, which are less power-hungry than the traditional CPU cores we have today.

05:37 Episode 4: PPU Adaptability and Integration

How does the PPU's seamless integration with existing CPU architectures and tools facilitate its broad applicability?

PPUs and Front-End Integration

JÖRG KELLER: The PPUs do not exist alone—they have front-end cores, which are normal CPU cores.

Using a well-known front-end instruction architecture from some current CPUs simplifies the start going to a CPU-plus-PPU architecture.

Existing codes for the CPU will immediately execute on the CPU cores without a performance boost on the PPU cores.

I already mentioned the availability of a binary-to-binary translator for existing executables. This can help speed up those applications because they can utilize both the CPU and PPU cores.

Tools for PPU Application

JK: When we also have a compiler and IDE, we can work with existing source code. This allows for experimentation at different scales of time investment.

We might start with an automatic detection of parallelizable patterns and run the Flow compiler.

Options for Developer Investment

JK: This process might involve increasing time investment, recoding some things, up to hand-tuned kernels specifically written for the PPUs at the extreme end.

Ease of Transition to PPU Technology

JK: We have a range of options here.

The start comes with a low hurdle—you just have to switch to a new compiler and IDE, which makes the transition relatively straightforward, in my opinion.

It’s not just a wish for the future or something we would like to have. It’s a technology that has already proven itself to be doable and feasible to make such transitions possible.

08:11 Episode 5: PPU Latency and real-world performance

How does the PPU's focus on throughput affect real-world performance, especially when latency is crucial?

Understanding Latency in Applications and Memory Access

JÖRG KELLER: When we talk about latency, we really mean two things. For a whole application, we mean whether its reaction time is real-time or not.

For a single memory access, by latency, we mean how long it takes.

In a classic CPU core, these two are connected—a slow memory access that cannot overlap with other computations slows down the application, so the reaction time goes up.

Flow’s architecture is built to hide latencies through multiple fibers. Thus, a long memory access can be tolerated because its latency is hidden and does not translate into a long application reaction time.

Latency Decoupling with PPUs

JK: For latency, PPUs decouple memory access latency and latency from the application’s reaction. I think this is a good thing.

When we switch control from CPU cores to PPU cores by running fibers, there is some overhead. However, this overhead is much smaller than the overhead incurred when starting additional threads on a classic multi-core CPU.

PPU fibers share a large part of their state, and they all execute the same code.

Efficiency of Fibers vs. CPU Threads

JK: The startup time for fibers is much smaller than for CPU threads, and thus going parallel and handing over control to the PPUs is much faster.

Because of this, it pays off for much smaller parts of the code that can be parallelized with PPUs but would not be worth parallelizing on CPU threads, where the overhead would be too large for such a small workload.

10:51 Episode 6: PPU's Future Role in the Industry (Episode 6)

How does the PPU address the challenge and opportunity of hardware diversification in the computing industry?

Introduction to Processor Diversification

JÖRG KELLER: So after going to multi-cores for some years, now we see processors with different types of cores. For example, Arm's big.LITTLE as an early example, where some cores are more powerful and also more power-hungry and some are slower and are more energy-efficient.

We also see cores or CPUs that have additional specialized cores like crypto cores doing AES encryption or SHA-3 hashing. So they should provide high performance for a very special purpose.

We also see all kinds of accelerators that provide computing power for a certain class of operations and those accelerators appear both on the die and external. The most prominent among them are FPGAs and GPUs.

Challenges in Architecture Decisions

JK: Now when we see this diversification, it's clear that manufacturers must take a somewhat difficult decision, they must decide what to bring on a die.

How many power-hungry CPU cores, how many energy-efficient CPU cores? How many specialized CPU cores?

In this light, Flow's architecture with some CPU cores and a larger number of PPU cores is not exotic but it wedges in this scenario as it has different types of cores and the PPU cores are very efficient, both energy-wise and both as they cover a large class of applications.

Beyond this, architecture tries to avoid a number of disadvantages that other variants bring. High-performance CPUs normally target a broad range of use cases.

So some accelerator or co-processor should always be there as PPU cores can replace quite a number of those. It can be a candidate for consolidation in this area.

Beyond using only one type of accelerator, if you would name it as such, PPU cores show simplified programming because in the end, they can be programmed with fibers like we used to program classical multi-core CPU cores.

And we don't have exotic programming like in some classic accelerators.

Advantages of PPU Cores

JK: So PPU cores show promise to get this field into the next level and make it easier to access for a broad range of applications.

Introducing a new type of core is always difficult and so success naturally cannot be guaranteed.

However, PPU has at least one advantage. It comes at the right time.

So industry must do something because on the one hand, the requirements for performance go up and up.

On the other hand, the operating frequency cannot be scaled up anymore for technology and energy reasons.

Timing and Industry Evolution

JK: Also, we cannot simply replicate more and more cores on a large multi-core die because they simply won't fit on that.

So there has to be some way and PPU shows a promising path where to go without being as exotic as having a very unconventional type of accelerator.

So it might now be the right time to introduce PPU into the industry and thus raise their attention more than it might have done 10 years ago.

I mean, these ideas have been around for quite some time. I mean, it has developed already quite far and there are patterns and there have been prototypes.

So it's nothing that has just been invented in the lab last year. On the other hand, 10 years ago, most people say, OK, we have four cores or six cores or eight cores on our die.

And that's fine. And only for certain scenarios, like more or less embedded systems, energy played the prominent role that nowadays it has for everything.

Also, 10 years ago, most processors still were in computers. Today most processors are not in classic computers anymore but are in everyday devices like smartphones.

So the scenario has changed a lot in those 10 years. But the types of cores that we have did not really, so already Arm's big.LITTLE, to cite this example again, this looked like a large step although you just had some faster cores and some smaller cores, they had the same instruction set and they had basically the same capabilities and that was all.

So PPU goes much beyond that because it's about integrated parallelization. And that is something that I have not seen so far.

Curious to dive deeper into Flow Computing's PPU technology?

Explore the technical details and insights from Professor Keller's analysis in our full report. To request access, contact us at info@flow-computing.com, and we’ll be delighted to share it with you!