xenocara/lib/mesa/docs/drivers/amd/hw/pops.rst

Primitive Ordered Pixel Shading
===============================

Primitive Ordered Pixel Shading (POPS) is the feature available starting from
GFX9 that provides the Fragment Shader Interlock or Fragment Shader Ordering
functionality.

It allows a part of a fragment shader — an ordered section (or a critical
section) — to be executed sequentially in rasterization order for different
invocations covering the same pixel position.

This article describes how POPS is set up in shader code and the registers. The
information here is currently provided for architecture generations up to GFX11.

Note that the information in this article is **not official** and may contain
inaccuracies, as well as incomplete or incorrect assumptions. It is based on the
shader code output of the Radeon GPU Analyzer for Rasterizer Ordered View usage
in Direct3D shaders, AMD's Platform Abstraction Library (PAL), ISA references,
and experimentation with the hardware.

Shader code
-----------

With POPS, a wave can dynamically execute up to one ordered section. It is fine
for a wave not to enter an ordered section at all if it doesn't need ordering on
its execution path, however.

The setup of the ordered section consists of three parts:

1. Entering the ordered section in the current wave — awaiting the completion of
   ordered sections in overlapped waves.
2. Resolving overlap within the current wave — intrawave collisions (optional
   and GFX9–10.3 only).
3. Exiting the ordered section — resuming overlapping waves trying to enter
   their ordered sections.

GFX9–10.3: Entering the ordered section in the wave
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Awaiting the completion of ordered sections in overlapped waves is performed by
setting the POPS packer hardware register, and then polling the volatile
``pops_exiting_wave_id`` ALU operand source until its value exceeds the newest
overlapped wave ID for the current wave.

The information needed for the wave to perform the waiting is provided to it via
the SGPR argument ``COLLISION_WAVEID``. Its loading needs to be enabled in the
``SPI_SHADER_PGM_RSRC2_PS`` and ``PA_SC_SHADER_CONTROL`` registers (note that
the POPS arguments specifically need to be enabled not only in ``RSRC`` unlike
various other arguments, but in ``PA_SC_SHADER_CONTROL`` as well).

The collision wave ID argument contains the following unsigned values:

* [31]: Whether overlap has occurred.
* [29:28] (GFX10+) / [28] (GFX9): ID of the packer the wave should be associated
  with.
* [25:16]: Newest overlapped wave ID.
* [9:0]: Current wave ID.

The 2020 RDNA and RDNA 2 ISA references contain incorrect offsets and widths of
the fields, possibly from an early development iteration, but the meanings of
them are accurate there.

The wait must not be performed if the "did overlap" bit 31 is set to 0,
otherwise it will result in a hang. Also, the bit being set to 0 indicates that
there are *both* no wave overlap *and no intrawave collisions* for the current
wave — so if the bit is 0, it's safe for the wave to skip all of the POPS logic
completely and execute the contents of the ordered section simply as usual with
unordered access as a potential additional optimization. The packer hardware
register, however, may be set even without overlap safely — it's the wait loop
itself that must not be executed if it was reported that there was no overlap.

The packer ID needs to be passed to the packer hardware register using
``s_setreg_b32`` so the wave can poll ``pops_exiting_wave_id`` on that packer.

On GFX9, the ``MODE`` (1) hardware register has two bits specifying which packer
the wave is associated with:

* [25]: The wave is associated with packer 1.
* [24]: The wave is associated with packer 0.

Initially, both of these bits are set 0, meaning that POPS is disabled for the
wave. If the wave needs to enter the ordered section, it must set bit 24 to 1 if
the packer ID in ``COLLISION_WAVEID`` is 0, or set bit 25 to 1 if the packer ID
is 1.

Starting from GFX10, the ``POPS_PACKER`` (25) hardware register is used instead,
containing the following fields:

* [2:1]: Packer ID.
* [0]: POPS enabled for the wave.

Initially, POPS is disabled for a wave. To start entering the ordered section,
bits 2:1 must be set to the packer ID from ``COLLISION_WAVEID``, and bit 0 needs
to be set to 1.

The wave IDs, both in ``COLLISION_WAVEID`` and ``pops_exiting_wave_id``, are
10-bit values wrapping around on overflow — consecutive waves are numbered 1022,
1023, 0, 1… This wraparound needs to be taken into account when comparing the
exiting wave ID and the newest overlapped wave ID.

Specifically, until the current wave exits the ordered section, its ID can't be
smaller than the newest overlapped wave ID or the exiting wave ID. So
``current_wave_id + 1`` can be subtracted from 10-bit wave IDs to remap them to
monotonically increasing unsigned values. In this case, the largest value,
0xFFFFFFFF, will correspond to the current wave, 10-bit values up to the current
wave ID will be in a range near 0xFFFFFFFF growing towards it, and wave IDs from
before the last wraparound will be near 0 increasing away from it. Subtracting
``current_wave_id + 1`` is equivalent to adding ``~current_wave_id``.

GFX9 has an off-by-one error in the newest overlapped wave ID: if the 10-bit
newest overlapped wave ID is greater than the 10-bit current wave ID (meaning
that it's behind the last wraparound point), 1 needs to be added to the newest
overlapped wave ID before using it in the comparison. This was corrected in
GFX10.

The exiting wave ID (not to be confused with "exited" — the exiting wave ID is
the wave that will exit the ordered section next) is queried via the
``pops_exiting_wave_id`` ALU operand source, numbered 239. Normally, it will be
one of the arguments of ``s_add_i32`` that remaps it from a wrapping 10-bit wave
ID to monotonically increasing one.

It's a volatile operand, and it needs to be read in a loop until its value
becomes greater than the newest overlapped wave ID (after remapping both to
monotonic). However, if it's too early for the current wave to enter the ordered
section, it needs to yield execution to other waves that may potentially be
overlapped — via ``s_sleep``. GFX9 requires a finite amount of delay to be
specified, AMD uses 3. Starting from GFX10, exiting the ordered section wakes up
the waiting waves, so the maximum delay of 0xFFFF can be used.

In pseudocode, the entering logic would look like this::

   bool did_overlap = collision_wave_id[31];
   if (did_overlap) {
      if (gfx_level >= GFX10) {
         uint packer_id = collision_wave_id[29:28];
         s_setreg_b32(HW_REG_POPS_PACKER[2:0], 1 | (packer_id << 1));
      } else {
         uint packer_id = collision_wave_id[28];
         s_setreg_b32(HW_REG_MODE[25:24], packer_id ? 0b10 : 0b01);
      }

      uint current_10bit_wave_id = collision_wave_id[9:0];
      // Or -(current_10bit_wave_id + 1).
      uint wave_id_remap_offset = ~current_10bit_wave_id;

      uint newest_overlapped_10bit_wave_id = collision_wave_id[25:16];
      if (gfx_level < GFX10 &&
          newest_overlapped_10bit_wave_id > current_10bit_wave_id) {
         ++newest_overlapped_10bit_wave_id;
      }
      uint newest_overlapped_wave_id =
         newest_overlapped_10bit_wave_id + wave_id_remap_offset;

      while (!(src_pops_exiting_wave_id + wave_id_remap_offset >
               newest_overlapped_wave_id)) {
         s_sleep(gfx_level >= GFX10 ? 0xFFFF : 3);
      }
   }

The SPIR-V fragment shader interlock specification requires an invocation — an
individual invocation, not the whole subgroup — to execute
``OpBeginInvocationInterlockEXT`` exactly once. However, if there are multiple
begin instructions, or even multiple begin/end pairs, under divergent
conditions, a wave may end up waiting for the overlapped waves multiple times.
Thankfully, it's safe to set the POPS packer hardware register to the same
value, or to run the wait loop, multiple times during the wave's execution, as
long as the ordered section isn't exited in between by the wave.

GFX11: Entering the ordered section in the wave
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Instead of exposing wave IDs to shaders, GFX11 uses the "export ready" wave
status flag to report that the wave may enter the ordered section. It's awaited
by the ``s_wait_event`` instruction, with the bit 0 ("don't wait for
``export_ready``") of the immediate operand set to 0. On GFX11 specifically, AMD
passes 0 as the whole immediate operand.

The "export ready" wait can be done multiple times safely.

GFX9–10.3: Resolving intrawave collisions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

On GFX9–10.3, it's possible for overlapping fragment shader invocations to be
placed not only in different waves, but also in the same wave, with the shader
code making sure that the ordered section is executed for overlapping
invocations in order.

This functionality is optional — it can be activated by enabling loading of the
``INTRAWAVE_COLLISION`` SGPR argument in ``SPI_SHADER_PGM_RSRC2_PS`` and
``PA_SC_SHADER_CONTROL``.

The lower 8 or 16 (depending on the wave size) bits of ``INTRAWAVE_COLLISION``
contain the mask of whether each quad in the wave starts a new layer of
overlapping invocations, and thus the ordered section code for them needs to be
executed after running it for all lanes with indices preceding that quad index
multiplied by 4. The rest of the bits in the argument need to be ignored — AMD
explicitly masks them out in shader code (although this is not necessary if the
shader uses "find first 1" to obtain the start of the next set of overlapping
quads or expands this quad mask into a lane mask).

For example, if the intrawave collision mask is 0b0000001110000100, or
``(1 << 2) | (1 << 7) | (1 << 8) | (1 << 9)``, the code of the ordered section
needs to be executed first only for quads 1:0 (lanes 7:0), then only for quads
6:2 (lanes 27:8), then for quad 7 (lanes 31:28), then for quad 8 (lanes 35:32),
and then for the remaining quads 15:9 (lanes 63:36).

This effectively causes the ordered section to be executed as smaller
"sub-subgroups" within the original subgroup.

However, this is not always compatible with the execution model of SPIR-V or
GLSL fragment shaders, so enabling intrawave collisions and wrapping a part of
the shader in a loop may be unsafe in some cases. One particular example is when
the shader uses subgroup operations influenced by lanes outside the current
quad. In this case, the code outside and inside the ordered section may be
executed with different sets of active invocations, affecting the results of
subgroup operations. But in SPIR-V and GLSL, fragment shader interlock is not
supposed to modify the set of active invocations in any way. So the intrawave
collision loop may break the results of subgroup operations in unpredictable
ways, even outside the driver's compiler infrastructure. Even if the driver
splits the subgroup exactly at ``OpBeginInvocationInterlockEXT`` and makes the
lane subsets rejoin exactly at ``OpEndInvocationInterlockEXT``, the application
and the compilers that created the source shader are still not aware of that
happening — the input SPIR-V or GLSL shader might have already gone through
various optimizations, such as common subexpression elimination which might
have considered a subgroup operation before ``OpBeginInvocationInterlockEXT``
and one after it equivalent.

The idea behind reporting intrawave collisions to shaders is to reduce the
impact on the parallelism of the part of the shader that doesn't depend on the
ordering, to avoid wasting lanes in the wave and to allow the code outside the
ordered section in different invocations to run in parallel lanes as usual. This
may be especially helpful if the ordered section is small compared to the rest
of the shader — for instance, a custom blending equation in the end of the usual
fragment shader for a surface in the world.

However, whether handling intrawave collisions is preferred is not a question
with one universal answer. Intrawave collisions are pretty uncommon without
multisampling, or when using sample interlock with multisampling, although
they're highly frequent with pixel interlock with multisampling, when adjacent
primitives cover the same pixels along the shared edge (though that's an
extremely expensive situation in general). But resolving intrawave collisions
adds some overhead costs to the shader. If intrawave overlap is unlikely to
happen often, or even more importantly, if the majority of the shader is inside
the ordered section, handling it in the shader may cause more harm than good.

GFX11 removes this concept entirely, instead overlapping invocations are always
placed in different waves.

GFX9–10.3: Exiting the ordered section in the wave
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To exit the ordered section and let overlapping waves resume execution and enter
their ordered sections, the wave needs to send the ``ORDERED_PS_DONE`` message
(7) using ``s_sendmsg``.

If the wave has enabled POPS by setting the packer hardware register, it *must
not* execute ``s_endpgm`` without having sent ``ORDERED_PS_DONE`` once, so the
message must be sent on all execution paths after the packer register setup.
However, if the wave exits before having configured the packer register, sending
the message is not required, though it's still fine to send it regardless of
that.

Note that if the shader has multiple ``OpEndInvocationInterlockEXT``
instructions executed in the same wave (depending on a divergent condition, for
example), it must still be ensured that ``ORDERED_PS_DONE`` is sent by the wave
only once, and especially not before any awaiting of overlapped waves.

Before the message is sent, all counters for memory accesses that need to be
primitive-ordered, both writes and (in case something after the ordered section
depends on the per-pixel data, for instance, the tail blending fallback in
order-independent transparency) reads, must be awaited. Those may include
``vm``, ``vs``, and in some cases ``lgkm`` (though normally primitive-ordered
memory accesses will be done through VMEM with divergent addresses, not SMEM, as
there's no synchronization between fragments at different pixel coordinates, but
it's still technically possible for a shader, even though pointless and
nonoptimal, to explicitly perform them in a waterfall loop, for instance, and
that must work correctly too). Without that, a race condition will occur when
the newly resumed waves start accessing the memory locations to which there
still are outstanding accesses in the current wave.

Another option for exiting is the ``s_endpgm_ordered_ps_done`` instruction,
which combines waiting for all the counters, sending the ``ORDERED_PS_DONE``
message, and ending the program. Generally, however, it's desirable to resume
overlapping waves as early as possible, including before the export, as it may
stall the wave for some time too.

GFX11: Exiting the ordered section in the wave
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The overlapping waves are resumed when the wave performs the last export (with
the ``done`` flag).

The same requirements for awaiting the memory access counters as on GFX9–10.3
still apply.

Memory access requirements
^^^^^^^^^^^^^^^^^^^^^^^^^^

The compiler needs to ensure that entering the ordered section implements
acquire semantics, and exiting it implements release semantics, in the fragment
interlock memory scope for ``UniformMemory`` and ``ImageMemory`` SPIR-V storage
classes.

A fragment interlock memory scope instance includes overlapping fragment shader
invocations executed by commands inside a single subpass. It may be considered a
subset of a queue family memory scope instance from the perspective of memory
barriers.

Fragment shader interlock doesn't perform implicit memory availability or
visibility operations. Shaders must do them by themselves for accesses requiring
primitive ordering, such as via ``coherent`` (``queuefamilycoherent``) in GLSL
or ``MakeAvailable`` and ``MakeVisible`` in at least the ``QueueFamily`` scope
in SPIR-V.

On AMD hardware, this means that the accessed memory locations must be made
available or visible between waves that may be executed on any compute unit — so
accesses must go directly to the global L2 cache, bypassing L0$ via the GLC flag
and L1$ via DLC.

However, it should be noted that memory accesses in the ordered section may be
expected by the application to be done in primitive order even if they don't
have the GLC and DLC flags. Coherent access not only bypasses, but also
invalidates the lower-level caches for the accessed memory locations. Thus,
considering that normally per-pixel data is accessed exclusively by the
invocation executing the ordered section, it's not necessary to make all reads
or writes in the ordered section for one memory location to be GLC/DLC — just
the first read and the last write: it doesn't matter if per-pixel data is cached
in L0/L1 in the middle of a dependency chain in the ordered section, as long as
it's invalidated in them in the beginning and flushed to L2 in the end.
Therefore, optimizations in the compiler must not simply assume that only
coherent accesses need primitive ordering — and moreover, the compiler must also
take into account that the same data may be accessed through different bindings.

Export requirements
^^^^^^^^^^^^^^^^^^^

With POPS, on all hardware generations, the shader must have at least one
export, though it can be a null or an ``off, off, off, off`` one.

Also, even if the shader doesn't need to export any real data, the export
skipping that was added in GFX10 must not be used, and some space must be
allocated in the export buffer, such as by setting ``SPI_SHADER_COL_FORMAT`` for
some color output to ``SPI_SHADER_32_R``.

Without this, the shader will be executed without the needed synchronization on
GFX10, and will hang on GFX11.

Drawing context setup
---------------------

Configuring POPS
^^^^^^^^^^^^^^^^

Most of the configuration is performed via the ``DB_SHADER_CONTROL`` register.

To enable POPS for the draw,
``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` should be set to 1.

On GFX9–10.3, ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` controls which
fragment shader invocations are considered overlapping:

* For pixel interlock, it must be set to 0 (1 sample).
* If sample interlock is sufficient (only synchronizing between invocations that
  have any common sample mask bits), it may be set to
  ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES`` — the number of sample coverage mask
  bits passed to the shader which is expected to use the sample mask to
  determine whether it's allowed to access the data for each of the samples. As
  of April 2023, PAL for some reason doesn't use non-1x
  ``POPS_OVERLAP_NUM_SAMPLES`` at all, even when using Direct3D Rasterizer
  Ordered Views or ``GL_INTEL_fragment_shader_ordering`` with sample shading
  (those APIs tie the interlock granularity to the shading frequency — Vulkan
  and OpenGL fragment shader interlock, however, allows specifying the interlock
  granularity independently of it, making it possible both to ask for finer
  synchronization guarantees and to require stronger ones than Direct3D ROVs can
  provide). However, with MSAA, on AMD hardware, pixel interlock generally
  performs *massively*, sometimes prohibitively, slower than sample interlock,
  because it causes fragment shader invocations along the common edge of
  adjacent primitives to be ordered as they cover the same pixels (even though
  they don't cover any common samples). So it's highly desirable for the driver
  to provide sample interlock, and to set ``POPS_OVERLAP_NUM_SAMPLES``
  accordingly, if the shader declares that it's enough for it via the execution
  mode.

On GFX11, when POPS is enabled, ``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE`` is
used in place of ``DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES`` from the earlier
architecture generations (and has a different bit offset in the register), and
``DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE_ENABLE`` must be set to 1. The GFX11
blending performance workaround overriding the intrinsic rate must not be
applied if POPS is used in the draw — the intrinsic rate override must be used
solely to control the interlock granularity in this case.

No explicit flushes/synchronization are needed when changing the pipeline state
variables that may be involved in POPS, such as the rasterization sample count.
POPS automatically keeps synchronizing invocations even between draws with
different sample counts (invocations with common coverage mask bits are
considered overlapping by the hardware, regardless of what those samples
actually are — only the indices are important).

Also, on GFX11, POPS uses ``DB_Z_INFO.NUM_SAMPLES`` to determine the coverage
sample count, and it must be equal to ``PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES``
even if there's no depth/stencil target.

Hardware bug workarounds
^^^^^^^^^^^^^^^^^^^^^^^^

Early revisions of GFX9 — ``CHIP_VEGA10`` and ``CHIP_RAVEN`` — contain a
hardware bug that may result in a hang, and need a workaround to be enabled.
Specifically, if POPS is used with 8 or more rasterization samples, or with 8 or
more depth/stencil target samples, ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP``
must be set to 1 for draws that satisfy this condition. In PAL, this is the
``waMiscPopsMissedOverlap`` workaround. It results in slightly lower performance
in those cases, increasing the frame time by around 1.5 to 2 times in
`nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_
on the RX Vega 10, but it's required in a pretty rare case (8x+ MSAA) and is
mandatory to ensure stability.

Also, even though ``DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP`` is not required
on chips other than the ``CHIP_VEGA10`` and ``CHIP_RAVEN`` GFX9 revisions, if
it's enabled for some reason on GFX10.1 (``CHIP_NAVI10``, ``CHIP_NAVI12``,
``CHIP_NAVI14``), and the draw uses POPS,
``DB_RENDER_OVERRIDE2.PARTIAL_SQUAD_LAUNCH_CONTROL`` must be set to
``PSLC_ON_HANG_ONLY`` to avoid a hang (see ``waStalledPopsMode`` in PAL).

Out-of-order rasterization interaction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is a largely unresearched topic currently. However, considering that POPS
is primarily the functionality of the Depth Block, similarity to the behavior of
out-of-order rasterization in depth/stencil testing may possibly be expected.

If the shader specifies an ordered interlock execution mode, out-of-order
rasterization likely must not be enabled implicitly.

As of April 2023, PAL doesn't have any rules specifically for POPS in the logic
determining whether out-of-order rasterization can be enabled automatically.
Some of the POPS usage cases may possibly be covered by the rule that always
disables out-of-order rasterization if the shader writes to Unordered Access
Views (storage resources), though fragment shader interlock can be used for
read-only purposes too (for ordering between draws that only read per-pixel data
and draws that may write it), so that may be an oversight.

Explicitly enabled relaxed rasterization order modifies the concept of
rasterization order itself in Vulkan, so from the point of view of the
specification of fragment shader interlock, relaxed rasterization order should
still be applicable regardless of whether the shader requests ordered interlock.
PAL also doesn't make any POPS-specific exceptions here as of April 2023.

Variable-rate shading interaction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

On GFX10.3, enabling ``DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER`` forces
the shading rate to be 1x1, thus the
``fragmentShadingRateWithFragmentShaderInterlock`` Vulkan device property must
be false.

On GFX11, by default, POPS itself can work with non-1x1 shading rates, and the
``fragmentShadingRateWithFragmentShaderInterlock`` property must be true.
However, if ``PA_SC_VRS_SURFACE_CNTL_1.FORCE_SC_VRS_RATE_FINE_POPS`` is set,
enabling POPS will force 1x1 shading rate.

The widest interlock granularity available on GFX11 — with the lowest possible
Depth Block intrinsic rate, 1x — is per-fine-pixel, however. There's no
synchronization between coarse fragment shader invocations if they don't cover
common fine pixels, so the ``fragmentShaderShadingRateInterlock`` Vulkan device
feature is not available.

Additional configuration
^^^^^^^^^^^^^^^^^^^^^^^^

These are some largely unresearched options found in the register declarations.
PAL doesn't use them, so it's unknown if they make any significant difference.
No effect was found in `nvpro-samples/vk_order_independent_transparency <https://github.com/nvpro-samples/vk_order_independent_transparency>`_
during testing on GFX9 ``CHIP_RAVEN`` and GFX11 ``CHIP_NAVI31``.

* ``DB_SHADER_CONTROL.EXEC_IF_OVERLAPPED`` on GFX9–10.3.
* ``PA_SC_BINNER_CNTL_0.BIN_MAPPING_MODE = BIN_MAP_MODE_POPS`` on GFX10+.