Although there are many theoretical approaches to implementing a modern GPU, practical solutions require a clear understanding of the problem and a route to turning designs into silicon. The challenges of building modern high-performance semiconductor devices and accelerating currently programmable rasterization workflows reveal likely directions for GPU hardware development.
Key challenges in ray tracing
Real-time ray-tracing acceleration has been regarded for the past decade and a half as one of the most troublesome problems for GPU designers. The mainstream API model introduced by DXR specifies an execution model that does not naturally map to traditional GPU execution patterns, creating serious design challenges for any GPU that needs to support it.
If you follow the DXR model and consider what must be implemented in a GPU to provide accelerated performance, three major problems emerge regardless of the chosen architecture. First, you need a way to generate and manage acceleration data structures that represent scene geometry for efficient ray traversal. Second, when tracing rays the GPU must test ray/geometry intersections and provide programmable interfaces for user-defined behavior. Third, traced rays can spawn new rays. DXR specifies additional details, but these three factors are the most important at a system level.
Why this is different from rasterization
Modern GPUs are designed to exploit spatial and temporal locality in DRAM accesses. Rasterization workloads, especially pixel shading, commonly benefit from locality: triangles and pixel vertices processed together frequently share data, so cache lines fetched for one group of pixels are likely to be reused by an adjacent group. GPU hardware and memory systems are optimized around this locality assumption.
Ray tracing breaks these locality assumptions. Ray tracing models light propagation from all light sources, so rays may hit any surface in the scene. Surface behavior can vary: some surfaces reflect or scatter light uniformly, some absorb light entirely, and some spawn secondary rays randomly. Only when all parallel rays happen to strike similar geometry and materials does the locality model hold.
Because parallel rays can differ in where they traverse the acceleration structure, what geometry they intersect, and whether they spawn new rays, divergence in behavior is more severe than typical divergence problems in geometric or pixel processing. This disparity has deep implications for mapping ray tracing onto existing GPU execution models and often causes memory accesses to become a critical bottleneck.
Surface-related issues
Consider an environment while reading this article: rays from light sources interact with many different surfaces, each with potentially distinct scattering, absorption, or emission properties. A rendering pipeline must handle all these cases. Only the simplest scenario, where rays striking a scene share similar surface types and behaviors, can benefit from the cache-friendly access patterns GPUs rely on.
Coherence gathering
One approach to reducing the divergence problem is to maintain hardware-level structures that gather rays with similar traversal state and direction so they can be processed together. Hardware can keep a hierarchical storage of rays submitted by software and select and group them by their direction and their current position in the acceleration structure. Grouping rays in this way increases the likelihood that they will access the same cached acceleration data during intersection tests and maximizes the number of ray/geometry intersection computations that can be executed in parallel.
By analyzing and scheduling rays in hardware, the system can group them in a GPU-friendly way for subsequent processing, helping to preserve the execution patterns GPUs were optimized for in rasterization. This reduces the need for specialized memory systems and eases integration with the rest of the GPU.
The coherence gathering mechanism must perform fast traversal, sorting, and scheduling of rays submitted to the hardware without backpressuring the ray-emission scheduler and without leaving downstream intersection units idle. These requirements make the hardware subsystem relatively complex: it must balance rapid ray reordering with low-latency throughput.
Absent a hardware mechanism, coherence handling would need to be pushed to the host or implemented as an intermediate compute stage on the GPU, assuming the hardware supports such a stage. In practice, these software or compute-based approaches have not demonstrated clear efficiency gains on real-time hardware platforms.
Industry context and compatibility
Efforts to address these challenges have been underway for some time. Ray tracing has grown into a widely adopted graphics capability across modern graphics systems.
A coherence-gathering implementation designed to be compatible with current ray-tracing APIs must handle cases where rays spawn new rays, where stacks are released, and where per-stage gathering is required. Performing coherence gathering at each stage helps realize the performance potential of hardware ray tracing while maintaining compatibility with existing API behaviors.
Common metrics used to characterize hardware ray-tracing performance include ray-bundle size, peak parallel traversal rate, empty-ray emission, and miss rate. These metrics are simple indicators but do not by themselves capture developer needs, since developers care about the utility of rays across the accelerated system, not just peak traversal numbers.
The goal of a coherent hardware ray-tracing system is to enable comprehensive ray-based workloads across the acceleration system so developers can budget ray counts for useful features. A hardware-level coherence gathering mechanism contributes to that goal by improving memory locality and parallel intersection efficiency within the GPU execution model.
ALLPCB