Name

    NV_shader_thread_shuffle

Name Strings

    GL_NV_shader_thread_shuffle

Contributors

    Jeannot Breton, NVIDIA
    Pat Brown, NVIDIA
    Eric Werness, NVIDIA
    Mark Kilgard, NVIDIA

Contact

    Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com)

Status

    Shipping.

Version

    Last Modified Date:         2/14/2014
    NVIDIA Revision:            3

Number

    OpenGL Extension #448

Dependencies

    This extension is written against the OpenGL 4.3 (Compatibility Profile)
    Specification.

    This extension is written against version 4.30 (revision 07) of the OpenGL 
    Shading Language Specification.

    OpenGL 4.3 and GLSL 4.3 are required.

    This extension interacts with NV_gpu_program5

Overview

    Implementations of the OpenGL Shading Language may, but are not required, 
    to run multiple shader threads for a single stage as a SIMD thread group, 
    where individual execution threads are assigned to thread groups in an 
    undefined, implementation-dependent order.  This extension provides a set 
    of new features to the OpenGL Shading Language to share data between 
    multiple threads within a thread group. 
    
    Shaders using the new functionalities provided by this extension should 
    enable this functionality via the construct
    
        #extension GL_NV_shader_thread_shuffle : require     (or enable)

    This extension also specifies some modifications to the program assembly
    language to support the thread data sharing functionalities.

New Procedures and Functions

    None


New Tokens

    None
          

Modifications to The OpenGL Shading Language Specification, Version 4.30 
(Revision 07)

    Including the following line in a shader can be used to control the 
    language features described in this extension:

      #extension GL_NV_shader_thread_shuffle : <behavior>

    where <behavior> is as specified in section 3.3.

    New preprocessor #defines are added to the OpenGL Shading Language:

      #define GL_NV_shader_thread_shuffle         1


    Modify Section 8.3, Common Functions, p. 133
    
    (add a function to share data between threads in a thread group)

    Syntax:
    
        int    shuffleDownNV(int data,   uint index, uint width, 
                            [out bool threadIdValid])
        ivec2  shuffleDownNV(ivec2 data, uint index, uint width, 
                            [out bool threadIdValid])
        ivec3  shuffleDownNV(ivec3 data, uint index, uint width, 
                            [out bool threadIdValid])
        ivec4  shuffleDownNV(ivec4 data, uint index, uint width, 
                            [out bool threadIdValid])

        uint   shuffleDownNV(uint  data, uint index, uint width, 
                            [out bool threadIdValid])
        uvec2  shuffleDownNV(uvec2 data, uint index, uint width, 
                            [out bool threadIdValid])
        uvec3  shuffleDownNV(uvec3 data, uint index, uint width, 
                            [out bool threadIdValid])
        uvec4  shuffleDownNV(uvec4 data, uint index, uint width, 
                            [out bool threadIdValid])

        float  shuffleDownNV(float data, uint index, uint width, 
                            [out bool threadIdValid])
        vec2   shuffleDownNV(vec2  data, uint index, uint width, 
                            [out bool threadIdValid])
        vec3   shuffleDownNV(vec3  data, uint index, uint width, 
                            [out bool threadIdValid])
        vec4   shuffleDownNV(vec4  data, uint index, uint width, 
                            [out bool threadIdValid])

        bool   shuffleDownNV(bool data, uint index, uint width, 
                            [out bool threadIdValid])
        bvec2  shuffleDownNV(bvec2  data, uint index, uint width, 
                            [out bool threadIdValid])
        bvec3  shuffleDownNV(bvec3  data, uint index, uint width, 
                            [out bool threadIdValid])
        bvec4  shuffleDownNV(bvec4  data, uint index, uint width, 
                            [out bool threadIdValid])


        int    shuffleUpNV(int data,   uint index, uint width, 
                            [out bool threadIdValid])
        ivec2  shuffleUpNV(ivec2 data, uint index, uint width, 
                            [out bool threadIdValid])
        ivec3  shuffleUpNV(ivec3 data, uint index, uint width, 
                            [out bool threadIdValid])
        ivec4  shuffleUpNV(ivec4 data, uint index, uint width, 
                            [out bool threadIdValid])

        uint   shuffleUpNV(uint  data, uint index, uint width, 
                            [out bool threadIdValid])
        uvec2  shuffleUpNV(uvec2 data, uint index, uint width, 
                            [out bool threadIdValid])
        uvec3  shuffleUpNV(uvec3 data, uint index, uint width, 
                            [out bool threadIdValid])
        uvec4  shuffleUpNV(uvec4 data, uint index, uint width, 
                            [out bool threadIdValid])

        float  shuffleUpNV(float data, uint index, uint width, 
                            [out bool threadIdValid])
        vec2   shuffleUpNV(vec2  data, uint index, uint width, 
                            [out bool threadIdValid])
        vec3   shuffleUpNV(vec3  data, uint index, uint width, 
                            [out bool threadIdValid])
        vec4   shuffleUpNV(vec4  data, uint index, uint width, 
                            [out bool threadIdValid])

        bool   shuffleUpNV(bool  data, uint index, uint width, 
                            [out bool threadIdValid])
        bvec2  shuffleUpNV(bvec2 data, uint index, uint width, 
                            [out bool threadIdValid])
        bvec3  shuffleUpNV(bvec3 data, uint index, uint width, 
                            [out bool threadIdValid])
        bvec4  shuffleUpNV(bvec4 data, uint index, uint width, 
                            [out bool threadIdValid])


        int    shuffleXorNV(int data,   uint index, uint width, 
                            [out bool threadIdValid])
        ivec2  shuffleXorNV(ivec2 data, uint index, uint width, 
                            [out bool threadIdValid])
        ivec3  shuffleXorNV(ivec3 data, uint index, uint width, 
                            [out bool threadIdValid])
        ivec4  shuffleXorNV(ivec4 data, uint index, uint width, 
                            [out bool threadIdValid])

        uint   shuffleXorNV(uint  data, uint index, uint width, 
                            [out bool threadIdValid])
        uvec2  shuffleXorNV(uvec2 data, uint index, uint width, 
                            [out bool threadIdValid])
        uvec3  shuffleXorNV(uvec3 data, uint index, uint width, 
                            [out bool threadIdValid])
        uvec4  shuffleXorNV(uvec4 data, uint index, uint width, 
                            [out bool threadIdValid])

        float  shuffleXorNV(float data, uint index, uint width, 
                            [out bool threadIdValid])
        vec2   shuffleXorNV(vec2  data, uint index, uint width, 
                            [out bool threadIdValid])
        vec3   shuffleXorNV(vec3  data, uint index, uint width, 
                            [out bool threadIdValid])
        vec4   shuffleXorNV(vec4  data, uint index, uint width, 
                            [out bool threadIdValid])

        bool   shuffleXorNV(bool  data, uint index, uint width, 
                            [out bool threadIdValid])
        bvec2  shuffleXorNV(bvec2 data, uint index, uint width, 
                            [out bool threadIdValid])
        bvec3  shuffleXorNV(bvec3 data, uint index, uint width, 
                            [out bool threadIdValid])
        bvec4  shuffleXorNV(bvec4 data, uint index, uint width, 
                            [out bool threadIdValid])


        int    shuffleNV(int data,   uint index, uint width, 
                            [out bool threadIdValid])
        ivec2  shuffleNV(ivec2 data, uint index, uint width, 
                            [out bool threadIdValid])
        ivec3  shuffleNV(ivec3 data, uint index, uint width, 
                            [out bool threadIdValid])
        ivec4  shuffleNV(ivec4 data, uint index, uint width, 
                            [out bool threadIdValid])

        uint   shuffleNV(uint  data, uint index, uint width, 
                            [out bool threadIdValid])
        uvec2  shuffleNV(uvec2 data, uint index, uint width, 
                            [out bool threadIdValid])
        uvec3  shuffleNV(uvec3 data, uint index, uint width, 
                            [out bool threadIdValid])
        uvec4  shuffleNV(uvec4 data, uint index, uint width, 
                            [out bool threadIdValid])

        float  shuffleNV(float data, uint index, uint width, 
                            [out bool threadIdValid])
        vec2   shuffleNV(vec2  data, uint index, uint width, 
                            [out bool threadIdValid])
        vec3   shuffleNV(vec3  data, uint index, uint width, 
                            [out bool threadIdValid])
        vec4   shuffleNV(vec4  data, uint index, uint width, 
                            [out bool threadIdValid])

        bool   shuffleNV(bool  data, uint index, uint width, 
                            [out bool threadIdValid])
        bvec2  shuffleNV(bvec2 data, uint index, uint width, 
                            [out bool threadIdValid])
        bvec3  shuffleNV(bvec3 data, uint index, uint width, 
                            [out bool threadIdValid])
        bvec4  shuffleNV(bvec4 data, uint index, uint width, 
                            [out bool threadIdValid])

    Shuffle functions allow active threads within a thread group to exchange
    data using 4 different modes (up, down, xor, indexed).  They all load
    the operand <data> which can be different per thread and return a value
    read from the source thread at an address computed with the <index> and
    the <width> operands.

    <index> is a 5 bits value in the range 0 to 31, MSBs are ignored.
    <threadIdValid> is an optional operand.  It hold the value of the predicate
    that specifies if the source thread from which the current thread reads
    data is in range or not.  
      
    <width> is used for segmenting the thread group in multiple segments.  The
    segments need to be subdivided equally, so <width> needs to be a power of 2
    in the range 2 to 32.  Using a <width> of 32 would divide the thread
    group in a single segment.  A <width> of 8 would divide the thread group in
    4 segments of size 8.  Using a <width> that is not a power of 2, that is 
    lower than 2 or larger than 32 will return an undefined value.

    Threads can only share data within their own segment.  Each thread
    executing the built-in shuffle function will determine the ID of another
    thread by combining its value of gl_ThreadInWarpNV with its value of
    <index> as described below.  Such threads will attempt to read the value of
    <data> in the computed other thread and return that value to the caller.

    When a shuffle function attempts to access the value of <data> from another
    thread, it determines whether the other thread is in accessible range or
    not.  If it is in range, true will be returned in the optional 
    <threadIdValid> parameter, if provided by the caller.  If it's out of
    range, false will be returned in <threadIdValid>, if provided by the
    caller, and the value returned by the function will come from the current 
    thread.


    The 4 modes use the following logic to compute the source thread index and 
    the <threadIdValid> value:
    
    shuffleNV computes the source index using <index> as an absolute address
    within the thread group segment.
    
        srcThreadId = <index>
        <threadIdValid> = <index> < <width>
    
      For example, with this thread group segment:

                        -----------------
       Thread Id        |0|1|2|3|4|5|6|7|
                        -----------------
       Thread <data>    |a|b|c|d|e|f|g|h|
                        -----------------

      If <index> is 2

                        -----------------
       src thread Id    |2|2|2|2|2|2|2|2|
                        -----------------
       <threadIdValid>  |1|1|1|1|1|1|1|1|
                        -----------------
       result           |b|b|b|b|b|b|b|b|
                        -----------------

      If <index> is 9

                        -----------------
       src thread Id    |9|9|9|9|9|9|9|9|
                        -----------------
       <threadIdValid>  |0|0|0|0|0|0|0|0|
                        -----------------
       result           |a|b|c|d|e|f|g|h|
                        -----------------       

       
    shuffleUpNV subtracts <index> from the current thread id to get the source
    thread id.  This have the effect of shifting up the segment by <index> 
    threads.  Source thread id do not wrap around, so lower thread id
    will be left unchanged.
    
        srcThreadId = currentThreadId - <index>
        <threadIdValid> = srcThreadId >= 0
    
      For example, with this thread group segment:

                        -----------------
       Thread Id        |0|1|2|3|4|5|6|7|
                        -----------------
       Thread <data>    |a|b|c|d|e|f|g|h|
                        -----------------

      If <index> is 1

                        ------------------
       src thread Id    |-1|0|1|2|3|4|5|6|
                        ------------------
       <threadIdValid>  |0 |1|1|1|1|1|1|1|
                        ------------------
       result           |a |a|b|c|d|e|f|g|
                        ------------------     

        
    shuffleDownNV adds <index> to the current thread id to get the source 
    thread id.  This have the effect of shifting down the segment by 
    <index> threads. Source thread id do not wrap around, so higher thread id
    will be left unchanged.
    
        srcThreadId = currentThreadId + <index>
        <threadIdValid> = srcThreadId < <width>

      For example, with this thread group segment:

                        -----------------
       Thread Id        |0|1|2|3|4|5|6|7|
                        -----------------
       Thread <data>    |a|b|c|d|e|f|g|h|
                        -----------------

      If <index> is 2

                        -----------------
       src thread Id    |2|3|4|5|6|7|8|9|
                        -----------------
       <threadIdValid>  |1|1|1|1|1|1|0|0|
                        -----------------
       result           |c|d|e|f|g|h|g|h|
                        -----------------  

        
    shuffleXorNv does a bitwise xor between the <index> and the current 
    thread id to get the src thread id:
    
        srcThreadId = currentThreadId ^ <index>
        <threadIdValid> = srcThreadId < <width>

      For example, with this thread group segment:

                        -----------------
       Thread Id        |0|1|2|3|4|5|6|7|
                        -----------------
       Thread <data>    |a|b|c|d|e|f|g|h|
                        -----------------

      If <index> is 0x1

                        -----------------
       src thread Id    |1|0|3|2|5|4|7|6|
                        -----------------
       <threadIdValid>  |1|1|1|1|1|1|1|1|
                        -----------------
       result           |b|a|d|c|f|e|h|g|
                        -----------------  

Dependencies on NV_gpu_program5

    If NV_gpu_program5 is supported and "OPTION NV_shader_thread_shuffle" is 
    specified in an assembly program, the following edits are made to extend 
    the assembly programming model documented in the NV_gpu_program4 extension
    and extended by NV_gpu_program5.

    If NV_gpu_program5 is not supported, or if 
    "OPTION NV_shader_thread_shuffle" is not specified in an assembly program,
    the contents of this dependencies section should be ignored.

    Section 2.X.2, Program Grammar

    (add the following rules to the grammar)

    <VECTORop>              ::= "SHFDOWN"
                              | "SHFIDX"
                              | "SHFUP"
                              | "SHFXOR"


    Modify Section 2.X.4, Program Execution Environment  

    (Add the table entries and relevant text describing the program 
     instructions to exchange data between threads.)
    
      Instr-      Modifiers 
      uction  V  F I C S H D  Out   Inputs    Description
      ------- -- - - - - - -  ---   --------  --------------------------------      
      ...
      SHFDOWN 50 X X - - - - F  v   v,vu,vu   warp shuffle with added index
      SHFIDX  50 X X - - - - F  v   v,vu,vu   warp shuffle with absolute index
      SHFUP   50 X X - - - - F  v   v,vu,vu   warp shuffle with subtracted index
      SHFXOR  50 X X - - - - F  v   v,vu,vu   warp shuffle with XORed index    
      ...


    (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension, 
     as extended by NV_gpu_program5)

    + Shader thread shuffle (NV_shader_thread_shuffle)

    If a program specifies the "NV_shader_thread_shuffle" option, it may use
    the "SHFXOR", "SHFDOWN", "SHFIDX" and "SHFUP" instructions.  If this option
    is not specified, a program will fail to compile if it uses those 
    instructions.

    
    Section 2.X.8.Z, SHFDOWN:  warp shuffle with added index

    The SHFDOWN instruction allows a 32-bit scalar value to be exchanged
    between multiple thread within a thread group.  The instruction has 3
    operands as input.  The first operand is a 32-bit scalar.  This value will
    be shared between thread, it can be a float, a signed or an unsigned
    integer.  The second operand is an unsigned integer index in the range 0 to
    31.  It is used to compute from which thread the current thread will read
    the 32-bit scalar value.  For the SHFDOWN instruction this source thread is
    the id of the current thread added with the index operand.

    The last operand is an unsigned integer mask.  The mask is used for 
    segmenting the thread group and limiting the source thread index.  Bits 0
    to 4 of <mask> are a clamp value that limits the source thread index and 
    bits 8 to 12 a segmentation mask used to segment the thread group in 
    multiple smaller groups.  Together the clamp value and the segmentation
    mask will generate 2 internal values, the minThreadId and the maxThreadId,
    using the following logic:
    
      minThreadId = current thread id & segmentationMask
      
      maxThreadId = minThreadId | (clamp & ~segmentationMask)

    Those 2 values will segment the thread group by restricting the address
    range a specific thread can access.

    SHFDOWN returns a 2-component vector.  The first component is a predicate
    that is TRUE when the computed source thread id is in range and FALSE when 
    it's out of bounds.  For SHFDOWN, the source thread id is in range when it
    is lower than maxThreadId.  The second component holds a 32-bit value.
    When the source thread id is in range, this value comes from the source 
    thread.  When the source thread id is out of range, it read the value from
    the current thread.  If the source thread id reference to an inactive
    thread, the returned result will be undefined.

    SHFDOWN supports all data type modifiers.  For floating-point data types, 
    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned integer
    data types, the TRUE value is the maximum integer value (all bits are ones) 
    and the FALSE value is zero.
    
    
    Section 2.X.8.Z, SHFIDX:  warp shuffle with absolute index

    The SHFIDX instruction allows a 32-bit scalar value to be exchanged between
    multiple thread within a thread group.  The instruction has 3 operands as
    input.  The first operand is a 32-bit scalar.  This value will be shared
    between thread, it can be a float, a signed or an unsigned integer.  The 
    second operand is an unsigned integer index in the range 0 to 31.  It is
    used to compute from which thread the current thread will read the
    32-bit scalar value.  For the SHFIDX instruction, this source thread id is
    computed using the following operation:

      source thread id =( index operand & ~segmentationMask) | minThreadId

    The last operand is an unsigned integer mask.  The mask is used for 
    segmenting the thread group and limiting the source thread index.  Bits 0
    to 4 of <mask> are a clamp value that limits the source thread index and 
    bits 8 to 12 a segmentation mask used to segment the thread group in 
    multiple smaller groups.  Together the clamp value and the segmentation
    mask will generate 2 internal values, the minThreadId and the maxThreadId,
    using the following logic:
    
      minThreadId = current thread id & segmentationMask
      
      maxThreadId = minThreadId | (clamp & ~segmentationMask)

    Those 2 values will segment the thread group by restricting the address
    range a specific thread can access.

    SHFIDX returns a 2-component vector.  The first component is a predicate
    that is TRUE when the computed source thread id is in range and FALSE when 
    it's out of bounds.  For SHFIDX, the source thread id is in range when it
    is lower than maxThreadId.  The second component holds a 32-bit value.
    When the source thread id is in range, this value comes from the source 
    thread. When the source thread id is out of range, it read the value from 
    the current thread.  If the source thread id reference to an inactive 
    thread, the returned result will be undefined.

    SHFIDX supports all data type modifiers.  For floating-point data types, 
    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned integer
    data types, the TRUE value is the maximum integer value (all bits are ones) 
    and the FALSE value is zero.


    Section 2.X.8.Z, SHFUP:  warp shuffle with subtracted index

    The SHFUP instruction allows a 32-bit scalar value to be exchanged between
    multiple thread within a thread group.  The instruction has 3 operands as
    input.  The first operand is a 32-bit scalar.  This value will be shared 
    between thread, it can be a float, a signed or an unsigned integer.  The 
    second operand is an unsigned integer index in the range 0 to 31.  It is
    used to compute from which thread the current thread will read the 32-bit
    scalar value.  For the SHFUP instruction this source thread is the id of
    the current thread subtracted with the index operand.

    The last operand is an unsigned integer mask.  The mask is used for 
    segmenting the thread group and limiting the source thread index.  Bits 0
    to 4 of <mask> are a clamp value that limits the source thread index and 
    bits 8 to 12 a segmentation mask used to segment the thread group in 
    multiple smaller groups.  Together the clamp value and the segmentation
    mask will generate 2 internal values, the minThreadId and the maxThreadId,
    using the following logic:
    
      minThreadId = current thread id & segmentationMask
      
      maxThreadId = minThreadId | (clamp & ~segmentationMask)

    Those 2 values will segment the thread group by restricting the address
    range a specific thread can access.

    SHFUP returns a 2-component vector.  The first component is a predicate
    that is TRUE when the computed source thread id is in range and FALSE when 
    it's out of bounds.  For SHFUP, the source thread id is in range when it
    is greater than maxThreadId.  The second component holds a 32-bit value.
    When the source thread id is in range, this value comes from the source 
    thread.  When the source thread id is out of range, it read the value from
    the current thread.  If the source thread id reference to an inactive 
    thread, the returned result will be undefined.

    SHFUP supports all data type modifiers.  For floating-point data types, 
    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned integer
    data types, the TRUE value is the maximum integer value (all bits are ones)
    and the FALSE value is zero.

    
    Section 2.X.8.Z, SHFXOR:  warp shuffle with XORed index

    The SHFXOR instruction allows a 32-bit scalar value to be exchanged
    between multiple threads within a thread group.  The instruction has 3
    operands as input.  The first operand is a 32-bit scalar.  This value will
    be shared between threads, it can be a float, a signed or an unsigned 
    integer.  The second operand is an unsigned integer index in the range 0 to
    31.  It is used to compute from which thread the current thread will read
    the 32-bit scalar value.  For the SHFXOR instruction this source thread is
    the id of the current thread XORed with the index operand.

    The last operand is an unsigned integer mask.  The mask is used for 
    segmenting the thread group and limiting the source thread index.  Bits 0
    to 4 of <mask> are a clamp value that limits the source thread index and 
    bits 8 to 12 a segmentation mask used to segment the thread group in 
    multiple smaller groups.  Together the clamp value and the segmentation
    mask will generate 2 internal values, the minThreadId and the maxThreadId,
    using the following logic:
    
      minThreadId = current thread id & segmentationMask
      
      maxThreadId = minThreadId | (clamp & ~segmentationMask)

    Those 2 values will segment the thread group by restricting the address
    range a specific thread can access.

    SHFXOR returns a 2-component vector.  The first component is a predicate
    that is TRUE when the computed source thread id is in range and FALSE when 
    it's out of bounds.  For SHFXOR, the source thread id is in range when it
    is lower than maxThreadId.  The second component holds a 32-bit value.
    When the source thread id is in range, this value comes from the source 
    thread.  When the source thread id is out of range, it read the value from
    the current thread.  If the source thread id reference to an inactive 
    thread, the returned result will be undefined.

    SHFXOR supports all data type modifiers.  For floating-point data types, 
    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned integer
    data types, the TRUE value is the maximum integer value (all bits are ones) 
    and the FALSE value is zero.
  
Errors

    None.

New State

    None.

New Implementation Dependent State

    None.

Issues

    None


Revision History

    Rev.    Date    Author    Changes
    ----  --------  --------  -----------------------------------------
     3     2/14/14  jbreton    Rename the extension from NVX to NV.
     2      9/4/13  jbreton    Replace mask by width in the shuffle functions.
     1    11/27/12  jbreton    Internal revisions.
