*** Research Notes for the Project *** +---------------------------------------------------------------+ | NOTE: INCLUDES ONLY THE ABSOLUTE BASICS OF A GPU ARCHITECTURE | +---------------------------------------------------------------+ * GPU Parts: 1. Compute Cores 2. Memory Controllers/Partitions 3. Driver * High-Level Architecture: 1. Computing Cores + threads grouped into warps (e.g. 32 threads) + warps executed separately + when branch divergence for warps' threads -> [Thread Reconvergence] + for selecting which warp to execute -> [Warp Scheduling] + each warp can execute multiple instructions before switching -> [Instruction Flow] 2. Memory Controllers + * Thread Reconvergence Two Options: 1) [SIMT] Stack-based: (easier to implement ; potential deadlocks) + Runs through all the instructions serially + Has an "Active Mask" -> which threads in the warp should execute the instruction + Implemented via a stack that stores | Return PC | Next PC | Active Mask | columns + Top of the stack -> the active mask to be used + When branch divergence -> change next PC of current TOS ; add two new rows for the new branches + TOS should be the branch with the most active threads when diverging + Note: seems like might need to do some post-compilation inference work ?? (for determining return / reconv. PCs, idk yet, or maybe do it during runtime) + Example: ``` Assuming the following kernel: A; if (id % 2 == 0) { B; if (id == 0) { C; } else { D; } E; } else { F; } G; Assume 4 threads in a warp (ids from 0 to 3) Initially the stack looks like this: reconv. PC | next PC | Active Mask null | A | 1 1 1 1 <- TOS When 1st branch divergence occurs: reconv. PC | next PC | Active Mask null | G | 1 1 1 1 <- change A to G since that's where threads will reconv. G | F | 0 1 0 1 G | B | 1 0 1 0 <- TOS When B finishes executing, another divergence reconv. PC | next PC | Active Mask null | G | 1 1 1 1 G | F | 0 1 0 1 G | E | 1 0 1 0 <- change to E E | D | 0 0 1 0 E | C | 1 0 0 0 <- TOS After C finishes and PC goes to E -> pop the stack reconv. PC | next PC | Active Mask null | G | 1 1 1 1 G | F | 0 1 0 1 G | E | 1 0 1 0 E | D | 0 0 1 0 <- TOS D finishes; pop the stack (PC D gets to E) reconv. PC | next PC | Active Mask null | G | 1 1 1 1 G | F | 0 1 0 1 G | E | 1 0 1 0 <- TOS ... so on, reaching to PC G ``` 2) [SIMT] Stack-less: (harder to implement ; better performance ; no deadlocks w/ yielding) + Add a barrier participation mask + Thread active bits -> track whether the thread is being executed or not => getting group of threads that the scheduler that can decide between to execute (group threads by same PC) + Thread States -> ready to execute or blocked (no yielding yet for deadlock) + When reaching "add_barrier [barrier]" instruction creates new barrier mask, includes all the active threads inside the mask + When reaching "wait [barrier]" -> -> execute NOP until all threads in the barrier got to this point as well (track using a barrier state field) + Example: ``` Assume the following kernel: A; add_barrier b0; if (id % 2 == 0) { B; } else { C; add_barrier b1; if (id == 1) { D; } else { E; } wait b1; F; } wait b0; G; Imagine 8 threads per warp During A: All trheads are active: 0 1 2 3 4 5 6 7 --------------- 1 1 1 1 1 1 1 1 <- active threads Adding barrier b0: Creating new barrier participation mask: 0 1 2 3 4 5 6 7 --------------- 1 1 1 1 1 1 1 1 <- b1 barrier mask 0 0 0 0 0 0 0 0 <- b1 barrier state [initial] Half of threads go to B and other half to C => two thread groups 0 1 2 3 4 5 6 7 --------------- 1 0 1 0 1 0 1 0 <- warp split (two groups) Execute one thread group at a time Adding new barrier b2: Active threads are the half executing C with odd thread ids 0 1 2 3 4 5 6 7 --------------- 0 1 0 1 0 1 0 1 <- b2 barrier mask 0 0 0 0 0 0 0 0 <- b2 barrier state [initial] One thread jumps to D and other odd id-ed threads jump to E => now there are three thread groups to schedule and execute Reaching `wait b1` Threads who reach this -> get state blocked (not executable) => => thus, when executing thread group, blocked ones should get "ignored" When b2 barrer sate == b2 barrier mask, Make all threads executable again => two thread groups remaining (since two different PCs that threads are at to execute) Similar when waiting for b1 barrier ``` * Warp Scheduling + Round-Robin (RR) scheduling > same # of instructions executed per warp => all warps get equal attention + Greedy-then-Oldest (GTO) scheduling > when a warp stalls -> offload it and load in the oldest warp + Choose whichever one for now, upgrade later if needed ------ Ex: execute warp instructions until an async mem req was issued (w/ cache miss) * Instruction Flow + Each warp having access to an "Instruction Buffer" (I-Buf) + Each warp has a "Scoreboard" assigned to it with 3 or 4 entries, each entry an id of a register Ex: containing 3 entries R1, R5, R8 + Instruction get fetched from the Instruction Cache (I-Cache) + If any of the source of destination registers of the instruction matches with the scoreboard entries, then hold the instr. inside the buf but do not yet execute + When the current instruction finishes executing, it will clear some of the registers in the scoreboard, resulting in some instruction from the buffer to be eligible to be executed + When I-Buf is full, stall / drop the fetched instruction =================================== References: + Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book https://link.springer.com/book/10.1007/978-3-031-01759-9