@@ -962,6 +962,104 @@ The ``griddepcontrol`` intrinsics allows the dependent grids and prerequisite gr
962
962
For more information, refer
963
963
`PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-griddepcontrol >`__.
964
964
965
+ TCGEN05 family of Intrinsics
966
+ ----------------------------
967
+
968
+ The llvm.nvvm.tcgen05.* intrinsics model the TCGEN05 family of instructions
969
+ exposed by PTX. These intrinsics use 'Tensor Memory' (henceforth ``tmem ``).
970
+ NVPTX represents this memory using ``addrspace(6) `` and is always 32-bits.
971
+
972
+ For more information, refer to the PTX ISA
973
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory >`_.
974
+
975
+ The tensor-memory pointers may only be used with the tcgen05 intrinsics.
976
+ There are specialized load/store instructions provided (tcgen05.ld/st) to
977
+ work with tensor-memory.
978
+
979
+ See the PTX ISA for more information on tensor-memory load/store instructions
980
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-and-register-load-store-instructions >`_.
981
+
982
+ '``llvm.nvvm.tcgen05.alloc ``'
983
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
984
+
985
+ Syntax:
986
+ """""""
987
+
988
+ .. code-block :: llvm
989
+
990
+ declare void @llvm.nvvm.tcgen05.alloc.cg1(ptr %dst, i32 %ncols)
991
+ declare void @llvm.nvvm.tcgen05.alloc.cg2(ptr %dst, i32 %ncols)
992
+ declare void @llvm.nvvm.tcgen05.alloc.shared.cg1(ptr addrspace(3) %dst, i32 %ncols)
993
+ declare void @llvm.nvvm.tcgen05.alloc.shared.cg2(ptr addrspace(3) %dst, i32 %ncols)
994
+
995
+ Overview:
996
+ """""""""
997
+
998
+ The '``@llvm.nvvm.tcgen05.alloc.* ``' intrinsics correspond to the
999
+ ``tcgen05.alloc.cta_group*.sync.aligned.b32 `` family of PTX instructions.
1000
+ The ``tcgen05.alloc `` is a potentially blocking instruction which dynamically
1001
+ allocates the specified number of columns in the Tensor Memory and writes
1002
+ the address of the allocated Tensor Memory into shared memory at the
1003
+ location specified by ``%dst ``. The 32-bit operand ``%ncols `` specifies
1004
+ the number of columns to be allocated and it must be a power-of-two.
1005
+ The ``.shared `` variant explicitly uses shared memory address space for
1006
+ the ``%dst `` operand. The ``.cg1 `` and ``.cg2 `` variants generate
1007
+ ``cta_group::1 `` and ``cta_group::2 `` variants of the instruction respectively.
1008
+
1009
+ For more information, refer to the PTX ISA
1010
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions >`_.
1011
+
1012
+ '``llvm.nvvm.tcgen05.dealloc ``'
1013
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1014
+
1015
+ Syntax:
1016
+ """""""
1017
+
1018
+ .. code-block :: llvm
1019
+
1020
+ declare void @llvm.nvvm.tcgen05.dealloc.cg1(ptr addrspace(6) %tmem_addr, i32 %ncols)
1021
+ declare void @llvm.nvvm.tcgen05.dealloc.cg2(ptr addrspace(6) %tmem_addr, i32 %ncols)
1022
+
1023
+ Overview:
1024
+ """""""""
1025
+
1026
+ The '``@llvm.nvvm.tcgen05.dealloc.* ``' intrinsics correspond to the
1027
+ ``tcgen05.dealloc.* `` set of PTX instructions. The ``tcgen05.dealloc ``
1028
+ instructions deallocates the Tensor Memory specified by the Tensor Memory
1029
+ address ``%tmem_addr ``. The operand ``%tmem_addr `` must point to a previous
1030
+ Tensor Memory allocation. The 32-bit operand ``%ncols `` specifies the number
1031
+ of columns to be de-allocated. The ``.cg1 `` and ``.cg2 `` variants generate
1032
+ ``cta_group::1 `` and ``cta_group::2 `` variants of the instruction respectively.
1033
+
1034
+ For more information, refer to the PTX ISA
1035
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions >`_.
1036
+
1037
+ '``llvm.nvvm.tcgen05.relinq.alloc.permit ``'
1038
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1039
+
1040
+ Syntax:
1041
+ """""""
1042
+
1043
+ .. code-block :: llvm
1044
+
1045
+ declare void @llvm.nvvm.tcgen05.relinq.alloc.permit.cg1()
1046
+ declare void @llvm.nvvm.tcgen05.relinq.alloc.permit.cg2()
1047
+
1048
+ Overview:
1049
+ """""""""
1050
+
1051
+ The '``@llvm.nvvm.tcgen05.relinq.alloc.permit.* ``' intrinsics correspond
1052
+ to the ``tcgen05.relinquish_alloc_permit.* `` set of PTX instructions.
1053
+ This instruction specifies that the CTA of the executing thread is
1054
+ relinquishing the right to allocate Tensor Memory. So, it is illegal
1055
+ for a CTA to perform ``tcgen05.alloc `` after any of its constituent
1056
+ threads execute ``tcgen05.relinquish_alloc_permit ``. The ``.cg1 ``
1057
+ and ``.cg2 `` variants generate ``cta_group::1 `` and ``cta_group::2 ``
1058
+ flavors of the instruction respectively.
1059
+
1060
+ For more information, refer to the PTX ISA
1061
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-allocation-and-management-instructions >`_.
1062
+
965
1063
Other Intrinsics
966
1064
----------------
967
1065
0 commit comments