*** Research Notes for the Project ***


+---------------------------------------------------------------+
| NOTE: INCLUDES ONLY THE ABSOLUTE BASICS OF A GPU ARCHITECTURE |
+---------------------------------------------------------------+


* GPU Parts:
	1. Compute Cores
	2. Memory Controllers/Partitions
	3. Driver

* High-Level Architecture:
	1. Computing Cores
		+ threads grouped into warps (e.g. 32 threads)
		+ warps executed separately
		+ when branch divergence for warps' threads -> [Thread Reconvergence]
		+ for selecting which warp to execute -> [Warp Scheduling]
		+ each warp can execute multiple instructions before switching -> [Instruction Flow]
	2. Memory Controllers
		+ 


* Thread Reconvergence
	Two Options:
	1) [SIMT] Stack-based: (easier to implement ; potential deadlocks)
		+ Runs through all the instructions serially
		+ Has an "Active Mask" -> which threads in the warp should execute the instruction
		+ Implemented via a stack that stores | Return PC | Next PC | Active Mask | columns
		+ Top of the stack -> the active mask to be used
		+ When branch divergence -> change next PC of current TOS ; add two new rows for the new branches
		+ TOS should be the branch with the most active threads when diverging
		+ Note: seems like might need to do some post-compilation inference work ??
		  (for determining return / reconv. PCs, idk yet, or maybe do it during runtime)
		+ Example:
			```
			Assuming the following kernel:

			A;
			if (id % 2 == 0) {
				B;
				if (id == 0) {
					C;
				} else {
					D;
				}
				E;
			} else {
				F;
			}
			G;

			Assume 4 threads in a warp (ids from 0 to 3)

			Initially the stack looks like this:
			reconv. PC | next PC | Active Mask
			null       | A       | 1 1 1 1      <- TOS

			When 1st branch divergence occurs:
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1      <- change A to G since that's where threads will reconv.
			G          | F       | 0 1 0 1
			G          | B       | 1 0 1 0      <- TOS

			When B finishes executing, another divergence
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1
			G          | F       | 0 1 0 1
			G          | E       | 1 0 1 0      <- change to E
			E          | D       | 0 0 1 0
			E          | C       | 1 0 0 0      <- TOS

			After C finishes and PC goes to E -> pop the stack
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1
			G          | F       | 0 1 0 1
			G          | E       | 1 0 1 0 
			E          | D       | 0 0 1 0      <- TOS

			D finishes; pop the stack (PC D gets to E)
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1
			G          | F       | 0 1 0 1
			G          | E       | 1 0 1 0      <- TOS

			...

			so on, reaching to PC G

			```
	2) [SIMT] Stack-less: (harder to implement ; better performance ; no deadlocks w/ yielding)
		+ Add a barrier participation mask
		+ Thread active bits -> track whether the thread is being executed or not
		  => getting group of threads that the scheduler that can decide between to execute
		  (group threads by same PC)
		+ Thread States -> ready to execute or blocked (no yielding yet for deadlock)
		+ When reaching "add_barrier [barrier]" instruction
		  creates new barrier mask, includes all the active threads inside the mask
		+ When reaching "wait [barrier]" ->
		  -> execute NOP until all threads in the barrier got to this point as well
			 (track using a barrier state field)
		+ Example:
			```
			Assume the following kernel:

			A;
			add_barrier b0;
			if (id % 2 == 0) {
				B;
			} else {
				C;
				add_barrier b1;
				if (id == 1) {
					D;
				} else {
					E;
				}
				wait b1;
				F;
			}
			wait b0;
			G;

			Imagine 8 threads per warp
			During A:
			All trheads are active:

			0 1 2 3 4 5 6 7
			---------------
			1 1 1 1 1 1 1 1 <- active threads

			Adding barrier b0:
			Creating new barrier participation mask:

			0 1 2 3 4 5 6 7
			---------------
			1 1 1 1 1 1 1 1 <- b1 barrier mask
			0 0 0 0 0 0 0 0 <- b1 barrier state [initial]


			Half of threads go to B and other half to C => two thread groups

			0 1 2 3 4 5 6 7
			---------------
			1 0 1 0 1 0 1 0 <- warp split (two groups)

			Execute one thread group at a time

			Adding new barrier b2:
			Active threads are the half executing C with odd thread ids

			0 1 2 3 4 5 6 7
			---------------
			0 1 0 1 0 1 0 1 <- b2 barrier mask
			0 0 0 0 0 0 0 0 <- b2 barrier state [initial]

			One thread jumps to D and other odd id-ed threads jump to E
			=> now there are three thread groups to schedule and execute

			Reaching `wait b1`
			Threads who reach this -> get state blocked (not executable) =>
			=> thus, when executing thread group, blocked ones should get "ignored"

			When b2 barrer sate == b2 barrier mask, 
			Make all threads executable again => two thread groups remaining 
			(since two different PCs that threads are at to execute)

			Similar when waiting for b1 barrier
			```


* Warp Scheduling
	+ Round-Robin (RR) scheduling
		> same # of instructions executed per warp => all warps get equal attention
	+ Greedy-then-Oldest (GTO) scheduling
		> when a warp stalls -> offload it and load in the oldest warp
	+ Choose whichever one for now, upgrade later if needed
	------
	  Ex: execute warp instructions until an async mem req was issued (w/ cache miss)


* Instruction Flow
	+ Each warp having access to an "Instruction Buffer" (I-Buf)
	+ Each warp has a "Scoreboard" assigned to it
	  with 3 or 4 entries, each entry an id of a register
	  Ex: containing 3 entries R1, R5, R8
	+ Instruction get fetched from the Instruction Cache (I-Cache)
	+ If any of the source of destination registers of the instruction
	  matches with the scoreboard entries, then hold the instr. inside the buf
	  but do not yet execute
	+ When the current instruction finishes executing, it will clear some of the
	  registers in the scoreboard, resulting in some instruction from the buffer
	  to be eligible to be executed
	+ When I-Buf is full, stall / drop the fetched instruction


===================================

References:
	+ Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book
	  https://link.springer.com/book/10.1007/978-3-031-01759-9