All systems face limitations, and as one limitation is removed, another is revealed that had remained hidden. It is highly likely that this game of Whac-A-Mole will play out in AI systems that employ high-bandwidth memory (HBM).
Most systems are limited by memory bandwidth. Compute systems in general have maintained an increase in memory interface performance that barely matches the gains in compute performance, meaning that memory transfer rates are the bottleneck in most systems. This, in turn, limits the amount of computation that can be performed usefully. If that bottleneck is removed by any significant amount, then the total amount of computation possible is increased, power issues rise to the surface, and the heat that is generated may move to the front of the problem space.
The relationship between compute and memory is so ingrained that as an industry we rarely question it. Does cache improve the power/performance aspects of a program? We expect the answer to always be yes, but that is not always correct. The benefits of cache assume certain memory access behaviors that are not universally true. Logic simulation is one example where cache has been shown to slow down performance when the cache is smaller than the in-memory representation of the design. This is certainly not the only example.
Over the years, memory development has carried forward many of the assumptions about data access patterns. For example, DDR standards have continued to increase the block size that is accessed with each access because that is the only way to increase total bandwidth.
Part of this is concentration on the interface is to overcome fundamental limitations in the memory itself. “DRAM remains an analog device and the timing parameters that we deal with in the interior of the DRAM device are still very similar to devices that existed 20 years ago,” says Marc Greenberg, group director for product marketing in the Cadence IP Group. “What we have done is to change the physical layer to make that faster.”
If the memory itself cannot be made faster, you have to create multiple banks that are read in parallel and then transfer that data as quickly as possible. This assumes data locality and when that is not the case, these memory transfers become increasingly costly. This is one of the reasons people are talking about processing near memory, because it can sharply reduce the amount of data that needs to be transferred.
But this concept is also applicable if data can be brought closer to processing and the access sizes aligned to the problem. “The challenge is that it has to make business sense for folks who want to try and adopt it,” says Steven Woo, fellow and distinguished inventor at Rambus. “It also has to make technical sense in terms of being able to either rewrite applications or convert applications that you have today. It is not necessarily as straightforward to take existing applications and translate them to work on a new kind of architecture that would allow that to happen. In principle it is a good thing to do, but in many cases, it is processing that is the smallest piece and not the data. The challenge is that you have an infrastructure that you are already working with that doesn’t do that. There is also the question of how willing the industry is to support that kind of model?”
Many of these limitations do not exist when we are dealing with new application areas because existing solutions are less entrenched. One example of this is AI applications, where the compute/memory equation can be changed by the adoption of high-bandwidth memory (HBM).
HBM offers no fundamental change in the underlying memory technology. HBM, at its core, is DRAM. It thus suffers from all of the same limitations and problems as DRAM accessed over DDR, with a few additional negatives.
- Heat: DRAM hates heat and heat causes its operation to become less predictable. With an HBM solution, DRAM is moved closer to the main heat generators – the processors. This problem is so acute that even though HBM was originally conceived as a 3D stacking technology, where the memory would be placed on top of the processor die, that idea had to be shelved because of the thermal problems it caused. Thus a 2.5D packaging solution became the path forward.
- Capacity: HBM capacity is very limited compared to DRAM accessed through DDR. While HBM is gaining in capacity, it can never catch up to external memory because external memory is capable of utilizing every advancement made within the package, as well.
- Cost: HBM requires an interposer or bridge, and this is still relatively new technology. An interposer requires the fabrication of what is basically a PCB in silicon.
We need to dive a little into HBM to understand some of the limitations. “The way HBM works is that it is a fairly fixed configuration,” says Brett Murdock, senior product marketing manager for Synopsys. “It is not like a standard DDR interface where you can have multiple channels of DDR and multiple ranks and you can even build your system with whatever you want. The way that HBM is defined is cubes, which are rigid. You get one cube and it is either 4 devices high, 8 devices high, 12 devices high and with HBM3 it will add 16 devices high. In that cube you get a defined number of channels — either 16 data channels, which are 128 bits wide, or 32 data channels that are 64-bits wide. They call it a pseudo channel when they drop down to 64-bits width. So you have a certain number of data channels.”
Plus, you cannot add an arbitrary numbers of cubes. “Capacity is limited by a couple of things,” says Greenberg. “The physical layer goes through a silicon interposer, and the length of that is limited to a few millimeters. You could perhaps stretch that further if you had to. All implementations to date limit the physical distance of that interface, so that limits the number of HBM dies that you can fit around the SoC. In an extreme case that is 8, although I have not yet seen an application with more than 4 stacks of HBM. So you cannot get the same density with HBM as you could with DDR. It will be many years (or maybe never) before HBM capacities can rival the capacity of DDR.”
Applications that rely on huge amounts of data therefore have no choice but to stick with DRAM being accessed through a DDR interface.
Bandwidth and power
Memory transfer often accounts for a majority of power consumed by a system. However, it does not get as much attention because that power is not consumed within the die itself. Therefore, it does not require quite the same level of analysis and problem mitigation.
Looking at some of the numbers, GDDR5X will be twice as fast as the normal GDDR5 memory and in the future that is expected to achieve speeds up to 16Gbps and offering bandwidth up to 72GB/s.
The other memory standards are standing still. “We have LPDDR5 that was released by JEDEC earlier this year, and DDR5 will be released shortly,” says Vadhiraj Sankaranarayanan, technical marketing manager at Synopsys. “These memories are taking the speeds to a higher level than their predecessors. For LPDDR4 and dx on the mobile side, the top speeds is 4267Mb/s, and LPDDR5 will take that to 6400Mb/s. Similarly, for the enterprise server market, DDR4, which is the de facto memory technology, runs up to 3200Mb/s while DDR5 will have a max speed of 6400Mb/s. So both LPDDR5 and DDR5 will both have a max speed of 6400Mb/s, and that is a considerable speed increase.”
How does HBM stack up? “Today, the fastest HBM systems run at 3200 Mb/s, which is the HBM2e standard data rate,” says Brett Murdock, senior product marketing manager for Synopsys. “SK Hynix has made a public announcement that they are supporting HBM2e at 3600Mb/s. The next standard HBM3, due in 2022, will have data rates up to 6400Mb/s. There is a lot of runway for HBM to go a lot faster. It started off at a slow data rate because it was a new and unproven technology and because it is more expensive, it is still more of a niche technology than the others.”
Huge advantage for HBM
Having said that, HBM has a huge advantage over external memory. Access times and the power associated to do that are a fraction of the values found for external memory. AMD estimates that GDDR5 can provide 10.66GB/s of bandwidth per watt, but HBM can achieve more than 35GB/s per watt.
Today’s HBM products, with 4/8GB capacities are providing 307GB/s bandwidth. This is already considerably in excess of GDDR5x expectations. HBM2 enables 307GB/s of data bandwidth, compared to 85.2GB/s with four DDR4 DIMMs. The next version, HBM3, has 4Gbps transfer rates with 512GB/s bandwidth.
In addition, the memory configuration is different. The different organization of accesses can be utilized for further gain. Data sizing is important for many applications. “GDDR has smaller channels – 32 bits compared to HBM that is 128 bits,” says Synopsys’ Sankaranarayanan. “For matrix multiplication and other applications where you have a large amount of streaming data, HBM would be more effective than GDDR because you can get the data in a contiguous fashion. To get the same bandwidth, comparing GDDR and HBM, you would need many GDDR DRAMs, and that translates into system-level complexity.”
That adds other issues. “The fun part, and this is the system designers challenge, is how to best use the channels or pseudo channels in the system,” adds Synopsys’ Murdock. “They have to work out how to handle the interleaving between them to maximize the efficiency of the memory.”
The new problem
How does this create a new problem? The static power draw for the DRAM core will remain essentially the same and will increase with capacity. HBM is more power-efficient in terms of bits per watt, and enables higher transfer rates. Total power and energy may increase significantly if the memory is being used at full capacity.
Then we look at the impact on compute. “The major challenge with all most designs is that they can put in lots of compute, and they can add more compute, more parallelism, but it is actually a compute and memory problem,” says Ron Lowman, strategic marketing manager at Synopsys. “Systems have been constrained by bandwidth and fighting that bottleneck to memory.”
So what happens when that limitation goes away? “HBM provides an unprecedented amount of bandwidth between the CPU and memory,” says Cadence’s Greenberg. “HBM2E provides 2.4Tbit/s of bandwidth, with further specification enhancements on the horizon. By using an interposer-based technology, the energy-per-bit is kept low, but power — the product of energy-per-bit and number of bits transferred per second — may be relatively high at terabits-per-second transmission rates.”
This starts to cause some new problems. “One of these includes accounting for the power noise impact from HBM I/Os,” says Calvin Chow, senior area technical manager at ANSYS. “Even though the power per pin is lower, there are many more I/Os firing in parallel, resulting in significant increase in current consumption. Though the signal traces are shorter, there is still a noise concern due to the simultaneous switching of a large number of I/Os.”
The increase in transfer rates means that processors can be kept busy more of the time. “There is the matrix multiplication part of it, but there is also some vector processing that is needed,” says Synopsys’ Lowman. “It is a heterogenous compute environment, so there are different types of processors that are required in these chipsets. Minimizing the passing of data lowers the power. Playing around with architectural exploration is helpful.”
Most AI chips today rely on in-built SRAM, and these chips are limited by reticle size, meaning that if sufficient bandwidth could be obtained from external memory, such as offered by HBM, SRAM could be reduced and more processing could be included in the space that is freed up.
It will be necessary to find the new balance point between compute and memory bandwidth which becomes a system-level design problem. Many of the problems will be similar, but the scales are different. “Doing the necessary analysis becomes a challenge,” says Karthik Srinivasan, senior product manager for ANSYS. “One of the biggest challenges will be simulation capacity. When we are talking about GDDR, a channel is 32 or 64 bits wide, whereas in HBM you are looking at a 128-bit channel for each stack. You have to simulate all the signal traces along with all of the power delivery network, and this traverses from one die through silicon vias to the interposer traces to the parent logic die. The simulation needs to have an elegant workflow in order to construct the entire channel, and then you need the capacity to actually do the necessary simulations and ensure that you have no signal integrity issues.”
HBM creates a bright future. “HBM development will continue on an evolutionary path as the technology continuing to mature,” says Wendy Elsasser, distinguished engineer in Arm’s research group. “With performance and capacity improvements, HBM will be an enabler for state-of-the-art ML and analytics accelerators, as well as a contender in other markets like HPC. Managing power (optimally power neutrality) and thermal dissipation will continue to be a focus, as well as defining a robust RAS (reliability, availability, serviceability) solution for high-data-integrity use cases.”
HBM delivers considerably more bandwidth than any previous memory system, and at power/bit levels that could be an order of magnitude better than external DDR memory systems. How systems will make use of this new capability remains unknown at this point.
It also will enable a significant increase in total compute throughput, but this will come at the expense of total power in both the memory subsystem and from compute. Keeping these systems fed with enough power, and cool enough to ensure a safe operating environment for the DRAM memory, may become a challenge.
HBM Knowledge center
Special reports, videos, top stories, white papers, and more on HBM
HBM2E: The E Stands For Evolutionary
The new version of the high bandwidth memory standard promises greater speeds and feeds and that’s about it.
What’s Next For High Bandwidth Memory
Different approaches for breaking down the memory wall.