Name

    NV_gpu_program5_mem_extended

Name Strings

    GL_NV_gpu_program5_mem_extended

Contact

    Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com)

Status

    Shipping.

Version

    Last Modified Date:         October 30, 2012
    NVIDIA Revision:            1

Number

    OpenGL Extension #434

Dependencies

    NV_gpu_program5 is required.

    This extension is written against the NV_gpu_program5 extension
    specification, which itself is written against the NV_gpu_program4 and
    OpenGL 2.0 Specifications.

    This extension interacts trivially with EXT_shader_image_load_store,
    NV_shader_storage_buffer_object, and NV_compute_program5.

Overview

    This extension provides a new set of storage modifiers that can be used by
    NV_gpu_program5 assembly program instructions loading from or storing to
    various forms of GPU memory.  In particular, we provide support for loads
    and stores using the storage modifiers:

        .F16X2  .F16X4  .F16    (for 16-bit floating-point scalars/vectors)
        .S8X2   .S8X4           (for 8-bit signed integer vectors)
        .S16X2  .S16X4          (for 16-bit signed integer vectors)
        .U8X2   .U8X4           (for 8-bit unsigned integer vectors)
        .U16X2  .U16X4          (for 16-bit unsigned integer vectors)

    These modifiers are allowed for the following load/store instructions:

        LDC             Load from constant buffer

        LOAD            Global load
        STORE           Global store

        LOADIM          Image load (via EXT_shader_image_load_store)
        STOREIM         Image store (via EXT_shader_image_load_store)

        LDB             Load from storage buffer (via 
                          NV_shader_storage_buffer_object) 
        STB             Store to storage buffer (via 
                          NV_shader_storage_buffer_object) 

        LDS             Load from shared memory (via NV_compute_program5)
        STS             Store to shared memory (via NV_compute_program5)

    For assembly programs prior to this extension, it was necessary to access
    memory using packed types and then unpack with additional shader
    instructions.

    Similar capabilities have already been provided in the OpenGL Shading
    Language (GLSL) via the NV_gpu_shader5 extension, using the extended data
    types provided there (e.g., "float16_t", "u8vec4", "s16vec2").

New Procedures and Functions

    None.

New Tokens

    None.

Additions to Chapter 2 of the OpenGL 2.0 Specification (OpenGL Operation)

    (All modifications are relative to Section 2.X, GPU Programs, from the
     NV_gpu_program4 specification.)

    Modify Section 2.X.2, Program Grammar

    (add after the long list of grammar rules) If a program specifies the
    NV_gpu_program5_mem_extended program option, the following rules are added
    to the NV_gpu_program5 base program grammar:

    <opModifier>            ::= "F16X2"
                              | "F16X4"
                              | "S8X2"
                              | "S8X4"
                              | "S16X2"
                              | "S16X4"
                              | "U8X2"
                              | "U8X4"
                              | "U16X2"
                              | "U16X4"

    (Note:  This extension also provides new capabilities for the "F16"
     modifier.  Since it was already supported in NV_gpu_program5, it isn't
     being added to the grammar here.)


    Modify Section 2.X.4.1, Program Instruction Modifiers

    (add to Table X.14 of the NV_gpu_program4 specification.)

      Modifier  Description
      --------  ---------------------------------------------------
      F16       Convert to or from one 16-bit floating-point value, 
                or access one 16-bit floating-point value

      F16X2     Access two 16-bit floating-point values
      F16X4     Access four 16-bit floating-point values
      S8X2      Access two 8-bit signed integer values
      S8X4      Access four 8-bit signed integer values
      S16X2     Access two 16-bit signed integer values
      S16X4     Access four 16-bit signed integer values
      U8X2      Access two 8-bit unsigned integer values
      U8X4      Access four 8-bit unsigned integer values
      U16X2     Access two 16-bit unsigned integer values
      U16X4     Access four 16-bit unsigned integer values

    (modify discussion of storage modifiers for load and store operations,
     adding the entries added to the table above)

    For load and store operations, the "F32", "F32X2", "F32X4", "F64",
    "F64X2", "F64X4", "S8", "S8X2", "S8X4", "S16", "S16X2", "S16X4", "S32",
    "S32X2", "S32X4", "S64", "S64X2", "S64X4", "U8", "U8X2", "U8X4", "U16",
    "U16X2", "U16X4", "U32", "U32X2", "U32X4", "U64", "U64X2", "U64X4", "F16",
    "F16X2", and "F16X4" storage modifiers control how data are loaded from or
    stored to memory. ...


    Modify Section 2.X.4.5, Program Memory Access, from NV_gpu_program5

    (update pseudocode for BufferMemoryLoad)

      result_t_vec BufferMemoryLoad(char *address, OpModifier modifier)
      {
        result_t_vec result = { 0, 0, 0, 0 };
        switch (modifier) {
        
        /* Existing cases and code from NV_gpu_program5 unchanged. */

        case F16:
            result.x = ((float16_t *)address)[0];
            break;
        case F16X2:
            result.x = ((float16_t *)address)[0];
            result.y = ((float16_t *)address)[1];
            break;
        case S8X2:
            result.x = ((int8_t *)address)[0];
            result.y = ((int8_t *)address)[1];
            break;
        case S8X4:
            result.x = ((int8_t *)address)[0];
            result.y = ((int8_t *)address)[1];
            result.z = ((int8_t *)address)[2];
            result.w = ((int8_t *)address)[3];
            break;
        case S16X2:
            result.x = ((int16_t *)address)[0];
            result.y = ((int16_t *)address)[1];
            break;
        case S16X4:
            result.x = ((int16_t *)address)[0];
            result.y = ((int16_t *)address)[1];
            result.z = ((int16_t *)address)[2];
            result.w = ((int16_t *)address)[3];
            break;
        case U8X2:
            result.x = ((uint8_t *)address)[0];
            result.y = ((uint8_t *)address)[1];
            break;
        case U8X4:
            result.x = ((uint8_t *)address)[0];
            result.y = ((uint8_t *)address)[1];
            result.z = ((uint8_t *)address)[2];
            result.w = ((uint8_t *)address)[3];
            break;
        case U16X2:
            result.x = ((uint16_t *)address)[0];
            result.y = ((uint16_t *)address)[1];
            break;
        case U16X4:
            result.x = ((uint16_t *)address)[0];
            result.y = ((uint16_t *)address)[1];
            result.z = ((uint16_t *)address)[2];
            result.w = ((uint16_t *)address)[3];
            break;
        }
        return result;
      }

    (update pseudocode for BufferMemoryStore)

      void BufferMemoryStore(char *address, operand_t_vec operand, 
                             OpModifier modifier)
      {
        switch (modifier) {

        /* Existing cases and code from NV_gpu_program5 unchanged. */

        case F16:
            ((float16_t *)address)[0] = operand.x;
            break;
        case F16X2:
            ((float16_t *)address)[0] = operand.x;
            ((float16_t *)address)[1] = operand.y;
            break;
        case S8X2:
            ((int8_t *)address)[0] = operand.x;
            ((int8_t *)address)[1] = operand.y;
            break;
        case S8X4:
            ((int8_t *)address)[0] = operand.x;
            ((int8_t *)address)[1] = operand.y;
            ((int8_t *)address)[2] = operand.z;
            ((int8_t *)address)[3] = operand.w;
            break;
        case S16X2:
            ((int16_t *)address)[0] = operand.x;
            ((int16_t *)address)[1] = operand.y;
            break;
        case S16X4:
            ((int16_t *)address)[0] = operand.x;
            ((int16_t *)address)[1] = operand.y;
            ((int16_t *)address)[2] = operand.z;
            ((int16_t *)address)[3] = operand.w;
            break;
        case U8X2:
            ((uint8_t *)address)[0] = operand.x;
            ((uint8_t *)address)[1] = operand.y;
            break;
        case U8X4:
            ((uint8_t *)address)[0] = operand.x;
            ((uint8_t *)address)[1] = operand.y;
            ((uint8_t *)address)[2] = operand.z;
            ((uint8_t *)address)[3] = operand.w;
            break;
        case U16X2:
            ((uint16_t *)address)[0] = operand.x;
            ((uint16_t *)address)[1] = operand.y;
            break;
        case U16X4:
            ((uint16_t *)address)[0] = operand.x;
            ((uint16_t *)address)[1] = operand.y;
            ((uint16_t *)address)[2] = operand.z;
            ((uint16_t *)address)[3] = operand.w;
            break;
        }
      }

    (modify paragraph to indicate the alignment requirement for new storage
    modifiers) The address used for global memory loads or stores or offset
    used for constant buffer loads must be aligned to the fetch size
    corresponding to the storage opcode modifier.  For S8 and U8, the offset
    has no alignment requirements.  For F16, S8X2, S16, U8X2, and U16, the
    offset must be a multiple of two basic machine units.  For F32, S32, and
    U32, F16X2, S16X2, U16X2, S8X4, and U8X4, the offset must be a multiple of
    four.  For F32X2, F64, S32X2, S64, U32X2, U64, S16X4, and U16X4, the
    offset must be a multiple of eight.  ...  If an offset is not correctly
    aligned, the values returned by a buffer memory load will be undefined,
    and the effects of a buffer memory store will also be undefined.


    Modify Section 2.X.6, Program Options

    + Extended Memory Format Support (NV_gpu_program5_mem_extended)

    If a program specifies the "NV_gpu_program5_mem_extended" option, it may
    use the "F16", "F16X2", "F16X4", "S8X2", "S8X4", "S16X2", "S16X4", "U8X2",
    "U8X4", "U16X2", and "U16X4" storage modifiers on instructions loading
    values from memory or storing values to memory (LDC, LOAD, STORE, LOADIM,
    STOREIM, LDB, STB, LDS, STS).


Additions to Chapter 3 of the OpenGL 2.0 Specification (Rasterization)

    None.

Additions to Chapter 4 of the OpenGL 2.0 Specification (Per-Fragment
Operations and the Frame Buffer)

    None.

Additions to Chapter 5 of the OpenGL 2.0 Specification (Special Functions)

    None.

Additions to Chapter 6 of the OpenGL 2.0 Specification (State and
State Requests)

    None.

Additions to Appendix A of the OpenGL 2.0 Specification (Invariance)

    None.

Additions to the AGL/GLX/WGL Specifications

    None.

Dependencies on EXT_shader_image_load_store, NV_shader_storage_buffer_object,
and NV_compute_program5

    If EXT_shader_image_load_store is not supported, references to the LOADIM
    and STOREIM opcodes should be removed.

    If NV_shader_storage_buffer_object is not supported, references to the LDB
    and STB opcodes should be removed.

    If NV_compute_program5 is not supported, references to the LDS and STS
    opcodes should be removed.

Errors

    None.

New State

    None.

New Implementation Dependent State

    None.

Issues

    (1) Should this extension have its own extension string entry, or should
        its existence be inferred from the NV_gpu_program5 extension or some
        other extension?

      RESOLVED:  Provide a separate extension string entry, since this
      functionality was added after NV_gpu_program5 was published and may not
      be available on older drivers supporting NV_gpu_program5.

Revision History

    Revision 1, October 30, 2012 (pbrown):  Initial revision.
