diff options
author | Davit Grigoryan <[email protected]> | 2024-09-26 21:48:12 -0700 |
---|---|---|
committer | Davit Grigoryan <[email protected]> | 2024-09-26 21:48:12 -0700 |
commit | 6daa51bb42c6f26a8159a1b5dfd2cdca820e24de (patch) | |
tree | 08a04a67563ba9dbc9f34bae285816464f0b0028 /spec/notes.txt | |
parent | 9c727894188c93b3e6292f97064ae30d95af2848 (diff) |
add notes for thread divergence, warp scheduling, instruction flow
Diffstat (limited to 'spec/notes.txt')
-rw-r--r-- | spec/notes.txt | 201 |
1 files changed, 201 insertions, 0 deletions
diff --git a/spec/notes.txt b/spec/notes.txt index 6c5043d..c7e10ca 100644 --- a/spec/notes.txt +++ b/spec/notes.txt @@ -1 +1,202 @@ *** Research Notes for the Project *** + + ++---------------------------------------------------------------+ +| NOTE: INCLUDES ONLY THE ABSOLUTE BASICS OF A GPU ARCHITECTURE | ++---------------------------------------------------------------+ + + +* GPU Parts: + 1. Compute Cores + 2. Memory Controllers/Partitions + 3. Driver + +* High-Level Architecture: + 1. Computing Cores + + threads grouped into warps (e.g. 32 threads) + + warps executed separately + + when branch divergence for warps' threads -> [Thread Reconvergence] + + for selecting which warp to execute -> [Warp Scheduling] + + each warp can execute multiple instructions before switching -> [Instruction Flow] + 2. Memory Controllers + + + + +* Thread Reconvergence + Two Options: + 1) [SIMT] Stack-based: (easier to implement ; potential deadlocks) + + Runs through all the instructions serially + + Has an "Active Mask" -> which threads in the warp should execute the instruction + + Implemented via a stack that stores | Return PC | Next PC | Active Mask | columns + + Top of the stack -> the active mask to be used + + When branch divergence -> change next PC of current TOS ; add two new rows for the new branches + + TOS should be the branch with the most active threads when diverging + + Note: seems like might need to do some post-compilation inference work ?? + (for determining return / reconv. PCs, idk yet, or maybe do it during runtime) + + Example: + ``` + Assuming the following kernel: + + A; + if (id % 2 == 0) { + B; + if (id == 0) { + C; + } else { + D; + } + E; + } else { + F; + } + G; + + Assume 4 threads in a warp (ids from 0 to 3) + + Initially the stack looks like this: + reconv. PC | next PC | Active Mask + null | A | 1 1 1 1 <- TOS + + When 1st branch divergence occurs: + reconv. PC | next PC | Active Mask + null | G | 1 1 1 1 <- change A to G since that's where threads will reconv. + G | F | 0 1 0 1 + G | B | 1 0 1 0 <- TOS + + When B finishes executing, another divergence + reconv. PC | next PC | Active Mask + null | G | 1 1 1 1 + G | F | 0 1 0 1 + G | E | 1 0 1 0 <- change to E + E | D | 0 0 1 0 + E | C | 1 0 0 0 <- TOS + + After C finishes and PC goes to E -> pop the stack + reconv. PC | next PC | Active Mask + null | G | 1 1 1 1 + G | F | 0 1 0 1 + G | E | 1 0 1 0 + E | D | 0 0 1 0 <- TOS + + D finishes; pop the stack (PC D gets to E) + reconv. PC | next PC | Active Mask + null | G | 1 1 1 1 + G | F | 0 1 0 1 + G | E | 1 0 1 0 <- TOS + + ... + + so on, reaching to PC G + + ``` + 2) [SIMT] Stack-less: (harder to implement ; better performance ; no deadlocks w/ yielding) + + Add a barrier participation mask + + Thread active bits -> track whether the thread is being executed or not + => getting group of threads that the scheduler that can decide between to execute + (group threads by same PC) + + Thread States -> ready to execute or blocked (no yielding yet for deadlock) + + When reaching "add_barrier [barrier]" instruction + creates new barrier mask, includes all the active threads inside the mask + + When reaching "wait [barrier]" -> + -> execute NOP until all threads in the barrier got to this point as well + (track using a barrier state field) + + Example: + ``` + Assume the following kernel: + + A; + add_barrier b0; + if (id % 2 == 0) { + B; + } else { + C; + add_barrier b1; + if (id == 1) { + D; + } else { + E; + } + wait b1; + F; + } + wait b0; + G; + + Imagine 8 threads per warp + During A: + All trheads are active: + + 0 1 2 3 4 5 6 7 + --------------- + 1 1 1 1 1 1 1 1 <- active threads + + Adding barrier b0: + Creating new barrier participation mask: + + 0 1 2 3 4 5 6 7 + --------------- + 1 1 1 1 1 1 1 1 <- b1 barrier mask + 0 0 0 0 0 0 0 0 <- b1 barrier state [initial] + + + Half of threads go to B and other half to C => two thread groups + + 0 1 2 3 4 5 6 7 + --------------- + 1 0 1 0 1 0 1 0 <- warp split (two groups) + + Execute one thread group at a time + + Adding new barrier b2: + Active threads are the half executing C with odd thread ids + + 0 1 2 3 4 5 6 7 + --------------- + 0 1 0 1 0 1 0 1 <- b2 barrier mask + 0 0 0 0 0 0 0 0 <- b2 barrier state [initial] + + One thread jumps to D and other odd id-ed threads jump to E + => now there are three thread groups to schedule and execute + + Reaching `wait b1` + Threads who reach this -> get state blocked (not executable) => + => thus, when executing thread group, blocked ones should get "ignored" + + When b2 barrer sate == b2 barrier mask, + Make all threads executable again => two thread groups remaining + (since two different PCs that threads are at to execute) + + Similar when waiting for b1 barrier + ``` + + +* Warp Scheduling + + Round-Robin (RR) scheduling + > same # of instructions executed per warp => all warps get equal attention + + Greedy-then-Oldest (GTO) scheduling + > when a warp stalls -> offload it and load in the oldest warp + + Choose whichever one for now, upgrade later if needed + ------ + Ex: execute warp instructions until an async mem req was issued (w/ cache miss) + + +* Instruction Flow + + Each warp having access to an "Instruction Buffer" (I-Buf) + + Each warp has a "Scoreboard" assigned to it + with 3 or 4 entries, each entry an id of a register + Ex: containing 3 entries R1, R5, R8 + + Instruction get fetched from the Instruction Cache (I-Cache) + + If any of the source of destination registers of the instruction + matches with the scoreboard entries, then hold the instr. inside the buf + but do not yet execute + + When the current instruction finishes executing, it will clear some of the + registers in the scoreboard, resulting in some instruction from the buffer + to be eligible to be executed + + When I-Buf is full, stall / drop the fetched instruction + + +=================================== + +References: + + Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book + https://link.springer.com/book/10.1007/978-3-031-01759-9 |