add notes for thread divergence, warp scheduling, instruction flow

author: Davit Grigoryan <[email protected]> 2024-09-26 21:48:12 -0700
committer: Davit Grigoryan <[email protected]> 2024-09-26 21:48:12 -0700
commit: 6daa51bb42c6f26a8159a1b5dfd2cdca820e24de (patch)
tree: 08a04a67563ba9dbc9f34bae285816464f0b0028 /spec/notes.txt
parent: 9c727894188c93b3e6292f97064ae30d95af2848 (diff)
1 files changed, 201 insertions, 0 deletions
diff --git a/spec/notes.txt b/spec/notes.txt
index 6c5043d..c7e10ca 100644
--- a/spec/notes.txt
+++ b/spec/notes.txt
@@ -1 +1,202 @@
 *** Research Notes for the Project ***
+
+
++---------------------------------------------------------------+
+| NOTE: INCLUDES ONLY THE ABSOLUTE BASICS OF A GPU ARCHITECTURE |
++---------------------------------------------------------------+
+
+
+* GPU Parts:
+	1. Compute Cores
+	2. Memory Controllers/Partitions
+	3. Driver
+
+* High-Level Architecture:
+	1. Computing Cores
+		+ threads grouped into warps (e.g. 32 threads)
+		+ warps executed separately
+		+ when branch divergence for warps' threads -> [Thread Reconvergence]
+		+ for selecting which warp to execute -> [Warp Scheduling]
+		+ each warp can execute multiple instructions before switching -> [Instruction Flow]
+	2. Memory Controllers
+		+ 
+
+
+* Thread Reconvergence
+	Two Options:
+	1) [SIMT] Stack-based: (easier to implement ; potential deadlocks)
+		+ Runs through all the instructions serially
+		+ Has an "Active Mask" -> which threads in the warp should execute the instruction
+		+ Implemented via a stack that stores | Return PC | Next PC | Active Mask | columns
+		+ Top of the stack -> the active mask to be used
+		+ When branch divergence -> change next PC of current TOS ; add two new rows for the new branches
+		+ TOS should be the branch with the most active threads when diverging
+		+ Note: seems like might need to do some post-compilation inference work ??
+		  (for determining return / reconv. PCs, idk yet, or maybe do it during runtime)
+		+ Example:
+			```
+			Assuming the following kernel:
+
+			A;
+			if (id % 2 == 0) {
+				B;
+				if (id == 0) {
+					C;
+				} else {
+					D;
+				}
+				E;
+			} else {
+				F;
+			}
+			G;
+
+			Assume 4 threads in a warp (ids from 0 to 3)
+
+			Initially the stack looks like this:
+			reconv. PC | next PC | Active Mask
+			null       | A       | 1 1 1 1      <- TOS
+
+			When 1st branch divergence occurs:
+			reconv. PC | next PC | Active Mask
+			null       | G       | 1 1 1 1      <- change A to G since that's where threads will reconv.
+			G          | F       | 0 1 0 1
+			G          | B       | 1 0 1 0      <- TOS
+
+			When B finishes executing, another divergence
+			reconv. PC | next PC | Active Mask
+			null       | G       | 1 1 1 1
+			G          | F       | 0 1 0 1
+			G          | E       | 1 0 1 0      <- change to E
+			E          | D       | 0 0 1 0
+			E          | C       | 1 0 0 0      <- TOS
+
+			After C finishes and PC goes to E -> pop the stack
+			reconv. PC | next PC | Active Mask
+			null       | G       | 1 1 1 1
+			G          | F       | 0 1 0 1
+			G          | E       | 1 0 1 0 
+			E          | D       | 0 0 1 0      <- TOS
+
+			D finishes; pop the stack (PC D gets to E)
+			reconv. PC | next PC | Active Mask
+			null       | G       | 1 1 1 1
+			G          | F       | 0 1 0 1
+			G          | E       | 1 0 1 0      <- TOS
+
+			...
+
+			so on, reaching to PC G
+
+			```
+	2) [SIMT] Stack-less: (harder to implement ; better performance ; no deadlocks w/ yielding)
+		+ Add a barrier participation mask
+		+ Thread active bits -> track whether the thread is being executed or not
+		  => getting group of threads that the scheduler that can decide between to execute
+		  (group threads by same PC)
+		+ Thread States -> ready to execute or blocked (no yielding yet for deadlock)
+		+ When reaching "add_barrier [barrier]" instruction
+		  creates new barrier mask, includes all the active threads inside the mask
+		+ When reaching "wait [barrier]" ->
+		  -> execute NOP until all threads in the barrier got to this point as well
+			 (track using a barrier state field)
+		+ Example:
+			```
+			Assume the following kernel:
+
+			A;
+			add_barrier b0;
+			if (id % 2 == 0) {
+				B;
+			} else {
+				C;
+				add_barrier b1;
+				if (id == 1) {
+					D;
+				} else {
+					E;
+				}
+				wait b1;
+				F;
+			}
+			wait b0;
+			G;
+
+			Imagine 8 threads per warp
+			During A:
+			All trheads are active:
+
+			0 1 2 3 4 5 6 7
+			---------------
+			1 1 1 1 1 1 1 1 <- active threads
+
+			Adding barrier b0:
+			Creating new barrier participation mask:
+
+			0 1 2 3 4 5 6 7
+			---------------
+			1 1 1 1 1 1 1 1 <- b1 barrier mask
+			0 0 0 0 0 0 0 0 <- b1 barrier state [initial]
+
+
+			Half of threads go to B and other half to C => two thread groups
+
+			0 1 2 3 4 5 6 7
+			---------------
+			1 0 1 0 1 0 1 0 <- warp split (two groups)
+
+			Execute one thread group at a time
+
+			Adding new barrier b2:
+			Active threads are the half executing C with odd thread ids
+
+			0 1 2 3 4 5 6 7
+			---------------
+			0 1 0 1 0 1 0 1 <- b2 barrier mask
+			0 0 0 0 0 0 0 0 <- b2 barrier state [initial]
+
+			One thread jumps to D and other odd id-ed threads jump to E
+			=> now there are three thread groups to schedule and execute
+
+			Reaching `wait b1`
+			Threads who reach this -> get state blocked (not executable) =>
+			=> thus, when executing thread group, blocked ones should get "ignored"
+
+			When b2 barrer sate == b2 barrier mask, 
+			Make all threads executable again => two thread groups remaining 
+			(since two different PCs that threads are at to execute)
+
+			Similar when waiting for b1 barrier
+			```
+
+
+* Warp Scheduling
+	+ Round-Robin (RR) scheduling
+		> same # of instructions executed per warp => all warps get equal attention
+	+ Greedy-then-Oldest (GTO) scheduling
+		> when a warp stalls -> offload it and load in the oldest warp
+	+ Choose whichever one for now, upgrade later if needed
+	------
+	  Ex: execute warp instructions until an async mem req was issued (w/ cache miss)
+
+
+* Instruction Flow
+	+ Each warp having access to an "Instruction Buffer" (I-Buf)
+	+ Each warp has a "Scoreboard" assigned to it
+	  with 3 or 4 entries, each entry an id of a register
+	  Ex: containing 3 entries R1, R5, R8
+	+ Instruction get fetched from the Instruction Cache (I-Cache)
+	+ If any of the source of destination registers of the instruction
+	  matches with the scoreboard entries, then hold the instr. inside the buf
+	  but do not yet execute
+	+ When the current instruction finishes executing, it will clear some of the
+	  registers in the scoreboard, resulting in some instruction from the buffer
+	  to be eligible to be executed
+	+ When I-Buf is full, stall / drop the fetched instruction
+
+
+===================================
+
+References:
+	+ Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book
+	  https://link.springer.com/book/10.1007/978-3-031-01759-9
author	Davit Grigoryan <[email protected]>	2024-09-26 21:48:12 -0700
committer	Davit Grigoryan <[email protected]>	2024-09-26 21:48:12 -0700
commit	6daa51bb42c6f26a8159a1b5dfd2cdca820e24de (patch)
tree	08a04a67563ba9dbc9f34bae285816464f0b0028 /spec/notes.txt
parent	9c727894188c93b3e6292f97064ae30d95af2848 (diff)