summaryrefslogtreecommitdiff
path: root/spec
diff options
context:
space:
mode:
Diffstat (limited to 'spec')
-rw-r--r--spec/notes.txt201
1 files changed, 201 insertions, 0 deletions
diff --git a/spec/notes.txt b/spec/notes.txt
index 6c5043d..c7e10ca 100644
--- a/spec/notes.txt
+++ b/spec/notes.txt
@@ -1 +1,202 @@
*** Research Notes for the Project ***
+
+
++---------------------------------------------------------------+
+| NOTE: INCLUDES ONLY THE ABSOLUTE BASICS OF A GPU ARCHITECTURE |
++---------------------------------------------------------------+
+
+
+* GPU Parts:
+ 1. Compute Cores
+ 2. Memory Controllers/Partitions
+ 3. Driver
+
+* High-Level Architecture:
+ 1. Computing Cores
+ + threads grouped into warps (e.g. 32 threads)
+ + warps executed separately
+ + when branch divergence for warps' threads -> [Thread Reconvergence]
+ + for selecting which warp to execute -> [Warp Scheduling]
+ + each warp can execute multiple instructions before switching -> [Instruction Flow]
+ 2. Memory Controllers
+ +
+
+
+* Thread Reconvergence
+ Two Options:
+ 1) [SIMT] Stack-based: (easier to implement ; potential deadlocks)
+ + Runs through all the instructions serially
+ + Has an "Active Mask" -> which threads in the warp should execute the instruction
+ + Implemented via a stack that stores | Return PC | Next PC | Active Mask | columns
+ + Top of the stack -> the active mask to be used
+ + When branch divergence -> change next PC of current TOS ; add two new rows for the new branches
+ + TOS should be the branch with the most active threads when diverging
+ + Note: seems like might need to do some post-compilation inference work ??
+ (for determining return / reconv. PCs, idk yet, or maybe do it during runtime)
+ + Example:
+ ```
+ Assuming the following kernel:
+
+ A;
+ if (id % 2 == 0) {
+ B;
+ if (id == 0) {
+ C;
+ } else {
+ D;
+ }
+ E;
+ } else {
+ F;
+ }
+ G;
+
+ Assume 4 threads in a warp (ids from 0 to 3)
+
+ Initially the stack looks like this:
+ reconv. PC | next PC | Active Mask
+ null | A | 1 1 1 1 <- TOS
+
+ When 1st branch divergence occurs:
+ reconv. PC | next PC | Active Mask
+ null | G | 1 1 1 1 <- change A to G since that's where threads will reconv.
+ G | F | 0 1 0 1
+ G | B | 1 0 1 0 <- TOS
+
+ When B finishes executing, another divergence
+ reconv. PC | next PC | Active Mask
+ null | G | 1 1 1 1
+ G | F | 0 1 0 1
+ G | E | 1 0 1 0 <- change to E
+ E | D | 0 0 1 0
+ E | C | 1 0 0 0 <- TOS
+
+ After C finishes and PC goes to E -> pop the stack
+ reconv. PC | next PC | Active Mask
+ null | G | 1 1 1 1
+ G | F | 0 1 0 1
+ G | E | 1 0 1 0
+ E | D | 0 0 1 0 <- TOS
+
+ D finishes; pop the stack (PC D gets to E)
+ reconv. PC | next PC | Active Mask
+ null | G | 1 1 1 1
+ G | F | 0 1 0 1
+ G | E | 1 0 1 0 <- TOS
+
+ ...
+
+ so on, reaching to PC G
+
+ ```
+ 2) [SIMT] Stack-less: (harder to implement ; better performance ; no deadlocks w/ yielding)
+ + Add a barrier participation mask
+ + Thread active bits -> track whether the thread is being executed or not
+ => getting group of threads that the scheduler that can decide between to execute
+ (group threads by same PC)
+ + Thread States -> ready to execute or blocked (no yielding yet for deadlock)
+ + When reaching "add_barrier [barrier]" instruction
+ creates new barrier mask, includes all the active threads inside the mask
+ + When reaching "wait [barrier]" ->
+ -> execute NOP until all threads in the barrier got to this point as well
+ (track using a barrier state field)
+ + Example:
+ ```
+ Assume the following kernel:
+
+ A;
+ add_barrier b0;
+ if (id % 2 == 0) {
+ B;
+ } else {
+ C;
+ add_barrier b1;
+ if (id == 1) {
+ D;
+ } else {
+ E;
+ }
+ wait b1;
+ F;
+ }
+ wait b0;
+ G;
+
+ Imagine 8 threads per warp
+ During A:
+ All trheads are active:
+
+ 0 1 2 3 4 5 6 7
+ ---------------
+ 1 1 1 1 1 1 1 1 <- active threads
+
+ Adding barrier b0:
+ Creating new barrier participation mask:
+
+ 0 1 2 3 4 5 6 7
+ ---------------
+ 1 1 1 1 1 1 1 1 <- b1 barrier mask
+ 0 0 0 0 0 0 0 0 <- b1 barrier state [initial]
+
+
+ Half of threads go to B and other half to C => two thread groups
+
+ 0 1 2 3 4 5 6 7
+ ---------------
+ 1 0 1 0 1 0 1 0 <- warp split (two groups)
+
+ Execute one thread group at a time
+
+ Adding new barrier b2:
+ Active threads are the half executing C with odd thread ids
+
+ 0 1 2 3 4 5 6 7
+ ---------------
+ 0 1 0 1 0 1 0 1 <- b2 barrier mask
+ 0 0 0 0 0 0 0 0 <- b2 barrier state [initial]
+
+ One thread jumps to D and other odd id-ed threads jump to E
+ => now there are three thread groups to schedule and execute
+
+ Reaching `wait b1`
+ Threads who reach this -> get state blocked (not executable) =>
+ => thus, when executing thread group, blocked ones should get "ignored"
+
+ When b2 barrer sate == b2 barrier mask,
+ Make all threads executable again => two thread groups remaining
+ (since two different PCs that threads are at to execute)
+
+ Similar when waiting for b1 barrier
+ ```
+
+
+* Warp Scheduling
+ + Round-Robin (RR) scheduling
+ > same # of instructions executed per warp => all warps get equal attention
+ + Greedy-then-Oldest (GTO) scheduling
+ > when a warp stalls -> offload it and load in the oldest warp
+ + Choose whichever one for now, upgrade later if needed
+ ------
+ Ex: execute warp instructions until an async mem req was issued (w/ cache miss)
+
+
+* Instruction Flow
+ + Each warp having access to an "Instruction Buffer" (I-Buf)
+ + Each warp has a "Scoreboard" assigned to it
+ with 3 or 4 entries, each entry an id of a register
+ Ex: containing 3 entries R1, R5, R8
+ + Instruction get fetched from the Instruction Cache (I-Cache)
+ + If any of the source of destination registers of the instruction
+ matches with the scoreboard entries, then hold the instr. inside the buf
+ but do not yet execute
+ + When the current instruction finishes executing, it will clear some of the
+ registers in the scoreboard, resulting in some instruction from the buffer
+ to be eligible to be executed
+ + When I-Buf is full, stall / drop the fetched instruction
+
+
+===================================
+
+References:
+ + Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book
+ https://link.springer.com/book/10.1007/978-3-031-01759-9