Name

    ARB_fragment_shader_interlock

Name Strings

    GL_ARB_fragment_shader_interlock

Contact

    Slawomir Grajewski, Intel  (slawomir.grajewski 'at' intel.com)

Contributors

    Contributors to INTEL_fragment_shader_ordering
    Contributers to NV_fragment_shader_interlock

Notice

    Copyright (c) 2015 The Khronos Group Inc. Copyright terms at
        http://www.khronos.org/registry/speccopyright.html

Status

    Complete. Approved by the ARB on June 26, 2015.
    Ratified by the Khronos Board of Promoters on August 7, 2015.

Version

    Last Modified Date:        May 7, 2015
    Revision:                  2

Number

    ARB Extension #177

Dependencies

    This extension is written against the OpenGL 4.5 (Core Profile)
    Specification.

    This extension is written against version 4.50 (revision 5) of the OpenGL
    Shading Language Specification.

    OpenGL 4.2 or ARB_shader_image_load_store is required; GLSL 4.20 is
    required.

Overview

    In unextended OpenGL 4.5, applications may produce a
    large number of fragment shader invocations that perform loads and
    stores to memory using image uniforms, atomic counter uniforms,
    buffer variables, or pointers. The order in which loads and stores
    to common addresses are performed by different fragment shader
    invocations is largely undefined.  For algorithms that use shader
    writes and touch the same pixels more than once, one or more of the
    following techniques may be required to ensure proper execution ordering:

      * inserting Finish or WaitSync commands to drain the pipeline between
        different "passes" or "layers";

      * using only atomic memory operations to write to shader memory (which
        may be relatively slow and limits how memory may be updated); or

      * injecting spin loops into shaders to prevent multiple shader
        invocations from touching the same memory concurrently.

    This extension provides new GLSL built-in functions
    beginInvocationInterlockARB() and endInvocationInterlockARB() that delimit
    a critical section of fragment shader code.  For pairs of shader
    invocations with "overlapping" coverage in a given pixel, the OpenGL
    implementation will guarantee that the critical section of the fragment
    shader will be executed for only one fragment at a time.

    There are four different interlock modes supported by this extension,
    which are identified by layout qualifiers.  The qualifiers
    "pixel_interlock_ordered" and "pixel_interlock_unordered" provides mutual
    exclusion in the critical section for any pair of fragments corresponding
    to the same pixel.  When using multisampling, the qualifiers
    "sample_interlock_ordered" and "sample_interlock_unordered" only provide
    mutual exclusion for pairs of fragments that both cover at least one
    common sample in the same pixel; these are recommended for performance if
    shaders use per-sample data structures.

    Additionally, when the "pixel_interlock_ordered" or
    "sample_interlock_ordered" layout qualifier is used, the interlock also
    guarantees that the critical section for multiple shader invocations with
    "overlapping" coverage will be executed in the order in which the
    primitives were processed by the GL.  Such a guarantee is useful for
    applications like blending in the fragment shader, where an application
    requires that fragment values to be composited in the framebuffer in
    primitive order.

    This extension can be useful for algorithms that need to access per-pixel
    data structures via shader loads and stores.  Such algorithms using this
    extension can access such data structures in the critical section without
    worrying about other invocations for the same pixel accessing the data
    structures concurrently.  Additionally, the ordering guarantees are useful
    for cases where the API ordering of fragments is meaningful.  For example,
    applications may be able to execute programmable blending operations in
    the fragment shader, where the destination buffer is read via image loads
    and the final value is written via image stores.

New Procedures and Functions

    None.

New Tokens

    None.

Modifications to the OpenGL Shading Language Specification, Version 4.50

    Including the following line in a shader can be used to control the
    language features described in this extension:

      #extension GL_ARB_fragment_shader_interlock : <behavior>

    where <behavior> is as specified in section 3.3.

    New preprocessor #defines are added to the OpenGL Shading Language:

      #define GL_ARB_fragment_shader_interlock           1


    Modify Section 4.4.1.3, Fragment Shader Inputs (p. 63)

    (add to the list of layout qualifiers containing "early_fragment_tests",
     p. 63, and modify the surrounding language to reflect that multiple
     layout qualifiers are supported on "in")

      layout-qualifier-id
        pixel_interlock_ordered
        pixel_interlock_unordered
        sample_interlock_ordered
        sample_interlock_unordered

    (add to the end of the section, p. 63)

    The identifiers "pixel_interlock_ordered", "pixel_interlock_unordered",
    "sample_interlock_ordered", and "sample_interlock_unordered" control the
    ordering of the execution of shader invocations between calls to the
    built-in functions beginInvocationInterlockARB() and
    endInvocationInterlockARB(), as described in section 8.13.3. A
    compile or link error will be generated if more than one of these layout
    qualifiers is specified in shader code. If a program containing a
    fragment shader includes none of these layout qualifiers, it is as
    though "pixel_interlock_ordered" were specified.

    Add to the end of Section 8.13, Fragment Processing Functions (p. 170)

    8.13.3, Fragment Shader Execution Ordering Functions

    By default, fragment shader invocations are generally executed in
    undefined order. Multiple fragment shader invocations may be executed
    concurrently, including multiple invocations corresponding to a single
    pixel. Additionally, fragment shader invocations for a single pixel might
    not be processed in the order in which the primitives generating the
    fragments were specified in the OpenGL API.

    The paired functions beginInvocationInterlockARB() and
    endInvocationInterlockARB() allow shaders to specify a critical section,
    inside which stronger execution ordering is guaranteed.  When using the
    "pixel_interlock_ordered" or "pixel_interlock_unordered" qualifier,
    ordering guarantees are provided for any pair of fragment shader
    invocations X and Y triggered by fragments A and B corresponding to the
    same pixel. When using the "sample_interlock_ordered" or
    "sample_interlock_unordered" qualifier, ordering guarantees are provided
    for any pair of fragment shader invocations X and Y triggered by fragments
    A and B that correspond to the same pixel, where at least one sample of
    the pixel is covered by both fragments. No ordering guarantees are
    provided for pairs of fragment shader invocations corresponding to
    different pixels. Additionally, no ordering guarantees are provided for
    pairs of fragment shader invocations corresponding to the same fragment.
    When multisampling is enabled and the framebuffer has sample buffers,
    multiple fragment shader invocations may result from a single fragment due
    to the use of the "sample" auxiliary storage qualifier, OpenGL API
    commands forcing multiple shader invocations per fragment, or for other
    implementation-dependent reasons.

    When using the "pixel_interlock_unordered" or "sample_interlock_unordered"
    qualifier, the interlock will ensure that the critical sections of
    fragment shader invocations X and Y with overlapping coverage will never
    execute concurrently. That is, invocation X is guaranteed to complete its
    call to endInvocationInterlockARB() before invocation Y completes its call
    to beginInvocationInterlockARB(), or vice versa.

    When using the "pixel_interlock_ordered" or "sample_interlock_ordered"
    layout qualifier, the critical sections of invocations X and Y with
    overlapping coverage will be executed in a specific order, based on the
    relative order assigned to their fragments A and B.  If fragment A is
    considered to precede fragment B, the critical section of invocation X is
    guaranteed to complete before the critical section of invocation Y begins.
    When a pair of fragments A and B have overlapping coverage, fragment A is
    considered to precede fragment B if

      * the OpenGL API command producing fragment A was called prior to the
        command producing B, or

      * the point, line, triangle, [[compatibility profile: quadrilateral,
        polygon,]] or patch primitive producing fragment A appears earlier in
        the same strip, loop, fan, or independent primitive list producing
        fragment B.

    When [[compatibility profile: decomposing quadrilateral or polygon
    primitives or]] tessellating a single patch primitive, multiple
    primitives may be generated in an undefined implementation-dependent
    order.  When fragments A and B are generated from such unordered
    primitives, their ordering is also implementation-dependent.

    If fragment shader X completes its critical section before fragment shader
    Y begins its critical section, all stores to memory performed in the
    critical section of invocation X using a pointer, image uniform, atomic
    counter uniform, or buffer variable qualified by "coherent" are guaranteed
    to be visible to any reads of the same types of variable performed in the
    critical section of invocation Y.

    If multisampling is disabled, or if the framebuffer does not include
    sample buffers, fragment coverage is computed per-pixel. In this case,
    the "sample_interlock_ordered" or "sample_interlock_unordered" layout
    qualifiers are treated as "pixel_interlock_ordered" or
    "pixel_interlock_unordered", respectively.

      Syntax:

        void beginInvocationInterlockARB(void);
        void endInvocationInterlockARB(void);

      Description:

    The beginInvocationInterlockARB() and endInvocationInterlockARB() may only
    be placed inside the function main() of a fragment shader and may not be
    called within any flow control.  These functions may not be called after a
    return statement in the function main(), but may be called after a discard
    statement.  A compile- or link-time error will be generated if main()
    calls either function more than once, contains a call to one function
    without a matching call to the other, or calls endInvocationInterlockARB()
    before calling beginInvocationInterlockARB().

Additions to the AGL/GLX/WGL Specifications

    None.

Errors

    None.

New State

    None.

New Implementation Dependent State

    None.

Issues

    (1) When using multisampling, the OpenGL specification permits
        multiple fragment shader invocations to be generated for a single
        fragment.  For example, per-sample shading using the "sample"
        auxiliary storage qualifier or the MinSampleShading() OpenGL API command
        can be used to force per-sample shading.  What execution ordering
        guarantees are provided between fragment shader invocations generated
        from the same fragment?

      RESOLVED:  We don't provide any ordering guarantees in this extension.
      This implies that when using multisampling, there is no guarantee that
      two fragment shader invocations for the same fragment won't be executing
      their critical sections concurrently.  This could cause problems for
      algorithms sharing data structures between all the samples of a pixel
      unless accesses to these data structures are performed atomically.

      When using per-sample shading, the interlock we provide *does* guarantee
      that no two invocations corresponding to the same sample execute the
      critical section concurrently.  If a separate set of data structures is
      provided for each sample, no conflicts should occur within the critical
      section.

      Note that in addition to the per-sample shading options in the shading
      language and API, implementations may provide multisample antialiasing
      modes where the implementation can't simply run the fragment shader once
      and broadcast results to a large set of covered samples.

    (2) What performance differences are expected between shaders using the
       "pixel" and "sample" layout qualifier variants in this extension (e.g.,
       "pixel_invocation_ordered" and "sample_invocation_ordered")?

      RESOLVED:  We expect that shaders using "sample" qualifiers may have
      higher performance, since the implementation need not order pairs of
      fragments that touch the same pixel with "complementary" coverage.  Such
      situations are fairly common:  when two adjacent triangles combine to
      cover a given pixel, two fragments will be generated for the pixel but
      no sample will be covered by both.  When using "sample" qualifiers, the
      invocations for both fragments can run concurrently.  When using "pixel"
      qualifiers, the critical section for one fragment must wait until the
      critical section for the other fragment completes.

    (3) What performance differences are expected between shaders using the
       "ordered" and "unordered" layout qualifier variants in this extension
       (e.g., "pixel_invocation_ordered" and "pixel_invocation_unordered")?

      RESOLVED:  We expect that shaders using "unordered" may have higher
      performance, since the critical section implementation doesn't need to
      ensure that all previous invocations with overlapping coverage have
      completed their critical sections.  Some algorithms (e.g., building data
      structures in order-independent transparency algorithms) will require
      mutual exclusion when updating per-pixel data structures, but do not
      require that shaders execute in a specific ordering.

    (4) Are fragment shaders using this extension allowed to write outputs?
        If so, is there any guarantee on the order in which such outputs are
        written to the framebuffer?

      RESOLVED:  Yes, fragment shaders with critical sections may still write
      outputs.  If fragment shader outputs are written, they are stored or
      blended into the framebuffer in API order, as is the case for fragment
      shaders not using this extension.

    (5) What considerations apply when using this extension to implement a
        programmable form of conventional blending using image stores?

      RESOLVED:  Per-fragment operations performed in the pipeline following
      fragment shader execution obviously have no effect on image stores
      executing during fragment shader execution.  In particular, multisample
      operations such as broadcasting a single fragment output to multiple
      samples or modifying the coverage with alpha-to-coverage or a shader
      coverage mask output value have no effect.  Fragments can not be killed
      before fragment shader blending using the fixed-function alpha test or
      using the depth test with a Z value produced by the shader.  Fragments
      will normally not be killed by fixed-function depth or stencil tests,
      but those tests can be enabled before fragment shader invocations using
      the layout qualifier "early_fragment_tests".  Any required
      fixed-function features that need to be handled before programmable
      blending that aren't enabled by "early_fragment_tests" would need to be
      emulated in the shader.

      Note also that performing blend computations in the shader are not
      guaranteed to produce results that are bit-identical to these produced
      by fixed-function blending hardware, even if mathematically equivalent
      algorithms are used.

    (6) For operations accessing shared per-pixel data structures in the
        critical section, what operations (if any) must be performed in shader
        code to ensure that stores from one shader invocation are visible to
        the next?

      RESOLVED:  The "coherent" qualifier is required in the declaration of
      the shared data structures to ensure that writes performed by one
      invocation are visible to reads performed by another invocation.

      In shaders that don't use the interlock, "coherent" is not sufficient as
      there is no guarantee of the ordering of fragment shader invocations --
      even if invocation A can see the values written by another invocation B,
      there is no general guarantee that invocation A's read will be performed
      before invocation B's write.  The built-in function memoryBarrier() can
      be used to generate a weak ordering by which threads can communicate,
      but it doesn't order memory transactions between two separate
      invocations.  With the interlock, execution ordering between two threads
      from the same pixel is well-defined as long as the loads and stores are
      performed inside the critical section, and the use of "coherent" ensures
      that stores done by one invocation are visible to other invocations.

    (7) Should we provide an explicit mechanisms for shaders to indicate a
        critical section?  Or should we just automatically infer a critical
        section by analyzing shader code?  Or should we just wrap the entire
        fragment shader in a critical section?

      RESOLVED:  Provide an explicit critical section.

      We definitely don't want to wrap the entire shader in a critical section
      when a smaller section will suffice.  Doing so would hold off the
      execution of any other fragment shader invocation with the same (x,y)
      for the entire (potentially long) life of the fragment shader.  Hardware
      would need to track a large number of fragments awaiting execution, and
      may be so backed up that further fragments will be blocked even if they
      don't overlap with any fragments currently executing.  Providing a
      smaller critical section reduces the amount of time other fragments are
      blocked and allows implementations to perform useful work for
      conflicting fragments before they hit the critical section.

      While a compiler could analyze the code and wrap a critical section
      around all memory accesses, it may be difficult to determine which
      accesses actually require mutual exclusion and ordering, and which
      accesses are safe to do with no protection.  Requiring shaders to
      explicitly identify a critical section doesn't seem overwhelmingly
      burdensome, and allows applications to exclude memory accesses that it
      knows to be "safe".

    (8) What restrictions should be imposed on the use of the
        beginInvocationInterlockARB() and endInvocationInterlockARB() functions
        delimiting a critical section?

      RESOLVED:  We impose restrictions similar to those on the barrier()
      built-in function in tessellation control shaders to ensure that any
      shader using this functionality has a single critical section that can
      be easily identified during compilation.  In particular, we require that
      these functions be called in main() and don't permit them to be called
      in conditional flow control.

      These restrictions ensure that there is always exactly one call to the
      "begin" and "end" functions in a predictable location in the compiled
      shader code, and ensure that the compiler and hardware don't have to
      deal with unusual cases (like entering a critical section and never
      leaving, leaving a critical section without entering it, or trying to
      enter a critical section more than once).

Revision History

    Rev.    Date    Author        Changes
    ----  --------  --------     -----------------------------------------
     1    04/01/15  S.Grajewski  Inital version merging
                                 INTEL_fragment_shader_ordering with
                                 NV_fragment_shader_interlock

     2    05/07/15  S.Grajewski  Built-in functions
                                 beginInvocationInterlockARB() and
                                 endInvocationInterlockARB() have now ARB
                                 suffixes.
