In-Memory Compute Chips for ML Inference

Abstract

In-memory compute (CiM) has emerged as an attractive approach to reduce the high data-movement costs of von Neumann architectures. CiM can perform large-scale, highly parallel general matrix-matrix multiplication (GEMM) inside memory, which is a dominant computation in machine learning (ML) inference.

Reusing memory for computation raises three key questions: 1) Which type of CiM to use: with many analog and digital CiM variants, their suitability must be evaluated from a system perspective. 2) When to use CiM: ML inference workloads contain GEMMs with varied memory and compute characteristics, making it hard to determine when CiM outperforms conventional compute kernels. 3) Where to integrate CiM: each memory level offers different bandwidth and capacity, which affect data-transfer and locality benefits when integrating CiM.

This work explores how to answer these questions for integrating CiM to accelerate ML inference. We use Timeloop-Accelergy, to perform early system-level evaluation of CiM prototypes, including analog and digital compute primitives. We integrate CiM into different levels of on-chip caches in a baseline architecture similar to an Nvidia A100 and customize dataflows for various ML workloads. Our experiments show that CiM architectures can improve energy efficiency; at INT-8 precision, energy per operation can be reduced to 0.12x of the baseline in some cases, and with weight interleaving and replication, performance improvements of up to 4x were observed. This study provides insight into which CiM types, when, and at which cache levels they are best integrated to accelerate GEMM workloads.

Introduction

ML applications are now ubiquitous across domains such as automotive, healthcare, finance, and technology, driving demand for high-performance, energy-efficient ML hardware solutions.

Matrix-vector multiplication and general matrix-matrix multiplication (GEMM) are at the core of ML workloads such as convolutional neural networks and transformer networks. These computations are data intensive and create high energy costs on von Neumann processors like central processing units (CPU) and graphics processing units (GPU) because compute units are separated from storage, causing expensive memory accesses and data movement—commonly referred to as the "memory wall" or the von Neumann bottleneck.

CiM has been proposed to address the memory wall by performing computation directly in memory to reduce expensive data transfers and provide higher energy efficiency.

Design Space: What, When, and Where

What type of CiM

CiM can be broadly classified into analog and digital approaches depending on the computation domain. Analog CiM performs multiply-accumulate (MAC) operations in the analog or mixed-signal domain inside memory arrays. Peripheral circuits such as digital-to-analog converters (DACs) and analog-to-digital converters (ADCs) are required to communicate across CiM blocks and to mitigate analog noise, and ADCs typically incur significant area, latency, and energy costs, increasing analog CiM overhead. In contrast, digital CiM performs computation in the digital domain by executing bitwise operations and integer multiplications. Multiple bitwise operations are required to produce the final MAC outputs, which can increase digital CiM latency. Other design choices—such as memory cell type (SRAM-6T/8T), the number of wordlines or bitlines enabled per access, and weight mapping schemes within memory arrays—also complicate determining the most efficient CiM primitive at the system level.

When to use CiM

ML models are composed of GEMMs of varying shapes and sizes. A GEMM of shape M x N x K multiplies an M x K input matrix by a K x N weight matrix to produce an M x N output matrix. Arithmetic intensity, defined as the ratio of arithmetic operations (FLOPs) to memory accesses (bytes), indicates how dependent a GEMM is on memory. Figure 2 shows a roofline representation of GEMM performance versus arithmetic intensity. The figure illustrates that not all GEMMs require the full capability of a GPU, leading to underutilization of streaming multiprocessors (SMs). Using CiM for GEMM has the potential to achieve comparable performance to standard compute paradigms. Because GEMMs vary widely in their compute and memory demands, it remains unclear for which GEMM shapes CiM yields energy or performance advantages over the baseline.

Where to integrate CiM

Because GEMMs exhibit regular data-access patterns and provide both temporal and spatial locality, matrices are typically fetched from main memory into caches in tiles or smaller blocks. GPUs optimize their memory hierarchies to efficiently reuse tile data and execute GEMM in parallel across hundreds of tensor cores within SMs. CiM hardware can also exploit parallelism by enabling multiple columns and rows within memory arrays and leveraging parallel memory arrays for matrix multiplication. However, each memory level differs in bandwidth and capacity, which affects data reuse opportunities and the degree of parallelism when offloading computations to CiM. Therefore, locating CiM at a memory level that maximizes locality and CiM benefit is critical.

Methodology

To evaluate CiM advantages relative to general-purpose processors, we consider a range of workload specifications, memory levels, and CiM characteristics. Selecting an optimal dataflow for a given specification is crucial for achieving high performance and energy efficiency. Optimal dataflows efficiently schedule and allocate GEMM work on the available hardware resources, reducing memory accesses and improving data reuse. Algorithmic data reuse for GEMM can be computed as the number of MAC operations divided by the total matrix size, but observed data reuse depends on the chosen dataflow and the actual number of memory accesses.

We analyze and evaluate analog and digital CiM primitives based on SRAM for register file (RF) and shared memory (SMem) levels in a baseline architecture similar to an Nvidia A100. For each CiM architecture and GEMM shape, we search for the optimal dataflow to maximize CiM-induced performance and energy improvements. From an energy/performance perspective, we map which GEMM shapes benefit from what CiM type, when, and at which memory level.