Vulkan

when we first start working with gpus, we quickly become concerned with
    how do i get some code to run on it?
    how is this code interacting with the gpu?

suppose we want to execute some code path on the gpu
    we refer to this action as an invocation
    this invocation may not necessarily be independent
    this invocation may be mapped to multiple lanes
    it may not guarantee forward progress at every time step

we must be aware of possible constraints before making assumptions of its behavior

let's consider the monolithic gpu

┌───────────┐                  
│Scalar Unit│ for control flow, pointer/scalar arithmetic, shared operations, etc.
└───────────┘                  
┌───────────┐
│Scalar Reg.│ register file for the scalar unit, 12KB
└───────────┘
we can consider the scalar unit and scalar unit register file as a shared resource within a compute unit

┌───────────┐
│   SIMDx   │ vector processor with 16 Lanes
└───────────┘
┌───────────┐
│Vector Reg.│ vector register file 64KB :: 256 64x4 byte registers
└───────────┘
a single simd processor in a compute unit has 16 lanes, 
-- each of these lanes executes a synchronized vector operation (say a multiply-accumulate) on some 32 bit value
we also refer to these as work-items, thus, 1 simd unit in a compute unit has 16 work-items 
we call a collection of 64 work items a wavefront
we consider the wavefront as the atomic scheduling unit
each SIMD unit may buffer instructions for 10 wavefronts -- more on this later

┌───────────┐
│ L1 Cache  │ L1 vector data cache, ~16KB
└───────────┘
┌─────────────────────────────────────────────────────┐
│                 Local Data Share                    │
└─────────────────────────────────────────────────────┘
Shared local data store, 32 banks with conflict resolution, totaling 64KB

┌───────────────────────────────────────────────────────────────────┐
│                            Scheduler                              │
└───────────────────────────────────────────────────────────────────┘
buffers up to 40 wavefronts == 2560 work-items

-- we can imagine our gpu dispatching 1 wavefront to 1 compute unit which is pinned on 1 simd unit --

it is NOT the case that we map a wavefronts 64 work items to 4 simd units,
a wave front is only mapped to ONE simd unit
thus, it takes a simd unit 4 cycles to complete exeuction of a wavefront
only in aggregate can we consider the compute unit operating at 64 FP32 ops/cycle

thus we have the Compute Unit
┌───────────────────────────────────────────────────────────────────┐
│                           Scheduler                               │
└───────────────────────────────────────────────────────────────────┘
┌───────────┐ ┌─────────────────────────────────────────────────────┐
│ L1 Cache  │ │                 Local Data Share                    │
└───────────┘ └─────────────────────────────────────────────────────┘
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│Scalar Unit│ │   SIMD0   │ │   SIMD1   │ │   SIMD2   │ │   SIMD3   │
└───────────┘ └───────────┘ └───────────┘ └───────────┘ └───────────┘
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│Scalar Reg.│ │Vector Reg.│ │Vector Reg.│ │Vector Reg.│ │Vector Reg.│
└───────────┘ └───────────┘ └───────────┘ └───────────┘ └───────────┘

*we will assume userspace, and using existing API*

When we start writing some vukan code,                                                                            
The first thing that we do is make a couple of requests to our operating system                                   
We go ahead and request a surface that we will render on                                                          
 This varies by OS, but is encapsulated in a VkSurface                                                            
 ┌──                       ──┐     ┌─────────┐                                                                    
 │XCB, wl_surface, HWND, view├────►│VkSurface│                                                                    
 └──                       ──┘     └─────────┘                                                                    
From the VkSurface that we recieve, we can extrapolate                                                            
┌────────────────────┐                                                                                            
│Surface Capabilities│ - Image Count, Image Extent (x,y), transformations, ...                                    
└────────────────────┘                                                                                            
┌───────────────┐                                                                                                 
│Surface Formats│      - Color ordering and Range                                                                 
└───────────────┘                                                                                                 
┌─────────────┐                                                                                                   
│Present Modes│        - Immediate presentation, Mailbox, FIFO, ...                                               
└─────────────┘                                                                                                   
All though this is not all immediately useful, nor is the purpose of these details                                
 we will keep these in the back of our mind until later                                                           
                                                                                                                  
We then need to request for the graphics hardware present on the system                                           
On Linux, this may be through Mesa3D (AMD), or Nouveau (NVIDIA)                                                   
This provides us context for the device that we are running on                                                    
when making this request, we also make sure that the device supports any extra features we might need             
These are specified through Vulkan Extensions and Vulkan Layers                                                   
                                                                                                                  
This is all encapsulated with a Vulkan Instance Type, which exposes the Physical Hardware on the system           
                                                 ┌─────────────────┐                                              
┌───────────────────────────┐                 ┌─►│Physical Device 0│                                              
│VK_LAYER_KHRONOS_validation│                 │  └─────────────────┘                                              
│VK_KHR_surface             │    ┌──────────┐ │  ┌─────────────────┐                                              
│VK_KHR_xcb_surface         ├───►│VkInstance├─┼─►│Physical Device 1│                                              
│VK_EXT_debug_utils         │    └──────────┘ │  └─────────────────┘                                              
│...                        │                 │  ┌─────────────────┐                                              
└───────────────────────────┘                 └─►│      ....       │                                              
                                                 └─────────────────┘                                              
┌───────────────┐                                                                                                 
│Physical Device│ A physical device exposes various interfaces                                                    
└┬──────────────┘ These interfaces can be used to understand the capabilites of the hardware available,           
 │                  and make a decision with what device to move forward with                                     
 │                                                                                                                
 │ ┌──────────────────────────┐                                                                                   
 ├►│Physical Device Properites│ - Exposes the device Name, Type, ID, Max Resource Limits, ...                     
 │ └──────────────────────────┘                                                                                   
 │ ┌────────────────────────┐                                                                                     
 ├►│Physical Device Features│   - Exposes features such as Tesellation Shaders, Multiviewports, Sparse Memory ... 
 │ └────────────────────────┘                                                                                     
 │ ┌───────────────────────┐                                                                                      
 └►│Queue Family Properties│    - Exposes, well, queue families                                                   
   └───────────────────────┘                                                                                      
                                                                                                                  
VkQueues are the interface that we will be using to submit work to the GPU and present images to the surface