*** Research Notes for the Project ***


+---------------------------------------------------------------+
| NOTE: INCLUDES ONLY THE ABSOLUTE BASICS OF A GPU ARCHITECTURE |
+---------------------------------------------------------------+


* GPU Parts:
	1. Compute Cores
	2. Memory Controllers/Partitions
	3. Driver

* High-Level Architecture:
	1. Computing Cores
		+ threads grouped into warps (e.g. 32 threads)
		+ warps executed separately
		+ when branch divergence for warps' threads -> [Thread Reconvergence]
		+ for selecting which warp to execute -> [Warp Scheduling]
		+ each warp can execute multiple instructions before switching -> [Instruction Flow]
	2. Memory Controllers
		+ 


===================================


+-----------------------+
|       SIMT Core       |
+-----------------------+


* Thread Reconvergence
	Two Options:
	1) [SIMT] Stack-based: (easier to implement ; potential deadlocks)
		+ Runs through all the instructions serially
		+ Has an "Active Mask" -> which threads in the warp should execute the instruction
		+ Implemented via a stack that stores | Return PC | Next PC | Active Mask | columns
		+ Top of the stack -> the active mask to be used
		+ When branch divergence -> change next PC of current TOS ; add two new rows for the new branches
		+ TOS should be the branch with the most active threads when diverging
		+ Note: seems like might need to do some post-compilation inference work ??
		  (for determining return / reconv. PCs, idk yet, or maybe do it during runtime)
		+ For warp-splits, implement Multi-Path Stack, so no idleing left
		+ Example:
			```
			Assuming the following kernel:

			A;
			if (id % 2 == 0) {
				B;
				if (id == 0) {
					C;
				} else {
					D;
				}
				E;
			} else {
				F;
			}
			G;

			Assume 4 threads in a warp (ids from 0 to 3)

			Initially the stack looks like this:
			reconv. PC | next PC | Active Mask
			null       | A       | 1 1 1 1      <- TOS

			When 1st branch divergence occurs:
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1      <- change A to G since that's where threads will reconv.
			G          | F       | 0 1 0 1
			G          | B       | 1 0 1 0      <- TOS

			When B finishes executing, another divergence
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1
			G          | F       | 0 1 0 1
			G          | E       | 1 0 1 0      <- change to E
			E          | D       | 0 0 1 0
			E          | C       | 1 0 0 0      <- TOS

			After C finishes and PC goes to E -> pop the stack
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1
			G          | F       | 0 1 0 1
			G          | E       | 1 0 1 0 
			E          | D       | 0 0 1 0      <- TOS

			D finishes; pop the stack (PC D gets to E)
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1
			G          | F       | 0 1 0 1
			G          | E       | 1 0 1 0      <- TOS

			...

			so on, reaching to PC G

			```
	2) [SIMT] Stack-less: (harder to implement ; better performance ; no deadlocks w/ yielding)
		+ Add a barrier participation mask
		+ Thread active bits -> track whether the thread is being executed or not
		  => getting group of threads that the scheduler that can decide between to execute
		  (group threads by same PC)
		+ Thread States -> ready to execute or blocked (no yielding yet for deadlock)
		+ When reaching "add_barrier [barrier]" instruction
		  creates new barrier mask, includes all the active threads inside the mask
		+ When reaching "wait [barrier]" ->
		  -> execute NOP until all threads in the barrier got to this point as well
			 (track using a barrier state field)
		+ Example:
			```
			Assume the following kernel:

			A;
			add_barrier b0;
			if (id % 2 == 0) {
				B;
			} else {
				C;
				add_barrier b1;
				if (id == 1) {
					D;
				} else {
					E;
				}
				wait b1;
				F;
			}
			wait b0;
			G;

			Imagine 8 threads per warp
			During A:
			All trheads are active:

			0 1 2 3 4 5 6 7
			---------------
			1 1 1 1 1 1 1 1 <- active threads

			Adding barrier b0:
			Creating new barrier participation mask:

			0 1 2 3 4 5 6 7
			---------------
			1 1 1 1 1 1 1 1 <- b1 barrier mask
			0 0 0 0 0 0 0 0 <- b1 barrier state [initial]


			Half of threads go to B and other half to C => two thread groups

			0 1 2 3 4 5 6 7
			---------------
			1 0 1 0 1 0 1 0 <- warp split (two groups)

			Execute one thread group at a time

			Adding new barrier b2:
			Active threads are the half executing C with odd thread ids

			0 1 2 3 4 5 6 7
			---------------
			0 1 0 1 0 1 0 1 <- b2 barrier mask
			0 0 0 0 0 0 0 0 <- b2 barrier state [initial]

			One thread jumps to D and other odd id-ed threads jump to E
			=> now there are three thread groups to schedule and execute

			Reaching `wait b1`
			Threads who reach this -> get state blocked (not executable) =>
			=> thus, when executing thread group, blocked ones should get "ignored"

			When b2 barrer sate == b2 barrier mask, 
			Make all threads executable again => two thread groups remaining 
			(since two different PCs that threads are at to execute)

			Similar when waiting for b1 barrier
			```
	3) Custom Control Flow Unit (mix of two?):
			branch => do another warp split
			when barrier mask is empty / all threads have awaited at bsync ->
			-> or (|) masks of the warp split table entries who have same PC of bsync+1 and 
			add a new entry to the warp split table and remove the old one

			=> no need for rPC column ??

			Examples:

		One branch:
			`
			A;
			binit b0
			P0 = (id % 2 == 0);
			@P0 bra Y
		X:	B;
			jmp Z;
		Y:	C;
		Z:	bsync b0
			D;
			`
			b0 = mask 1 1 1 1 and pending mask 0 0 0 0
			one entry w/ mask 1 1 1 1 (PC=binit thing)
			after bra -> two entries 0 1 0 1 (PC=X) and 1 0 1 0 (PC=Y)
			assume PC=Y executes and then executes PC=Z (bsync) =>
			=> pending mask becomes 1 0 1 0 and it halts
			PC=X then executes, jumps to Z and bsyncs b0 =>
			=> pending mask of b0 becomes 1 1 1 1
			mask 1 1 1 1 == pending mask 1 1 1 1 =>
			=> all entries with PC=Z get merged into one with mask
			1 0 1 0 | 0 1 0 1 = 1 1 1 1

		Nested branches:
			`
			A;
			binit b0;
			P0 = (id % 2 == 0);
			@P0 bra Y;
		X:	B;
			binit b1;
			P0 = (id == 1);
			@P0 bra J;
		U:	E;
			jmp H;
		J:	F;
		H:	bsync b1;
			jmp Z;
		Y:	C;
		Z:	bsync b0;
			D;
			`

		While loop:
			`
			binit b0;
		L:	A;
			P0 = sth;
			@P0 bra L;
			bsync b0;
			`

			results in many splits and many 'blocked' warp entries
			but works

		Spinlock:
			`
			mov mutex, 0
			binit b0;
		Y:	yield;
			P0 = atomic_take(mutex);
			@!P0 bra Y;
		X:	A;
			atomic_release(mutex);
			bsync b0;
			`

			b0 -> initialized to 1 1 1 1
			yield -> just chooses the same warp entry
			for one thread -> P0 = 1 and for others P0 = 0
			=> two warp entries 0 1 0 0 (PC = X), 1 0 1 1 (PC = Y)
			if PC=Y gets chosen, it will yield and become PC=Y+1
			PC=X will do A, release the lock, and wait at bsync
			eventually, all threads waiting at bsync
			note: if branch is to a PC that a warp entry already has, just move the thread to that warp


* Warp Scheduling
	+ Round-Robin (RR) scheduling
		> same # of instructions executed per warp => all warps get equal attention
	+ Greedy-then-Oldest (GTO) scheduling
		> when a warp stalls -> offload it and load in the oldest warp
	+ Choose whichever one for now, upgrade later if needed
	------
	  Ex: execute warp instructions until an async mem req was issued (w/ cache miss)


* Instruction Flow
	+ Each warp having access to an "Instruction Buffer" (I-Buf)
	+ Each warp has a "Scoreboard" assigned to it
	  with 3 or 4 entries, each entry an id of a register
	  Ex: containing 3 entries R1, R5, R8
	+ Instruction get fetched from the Instruction Cache (I-Cache)
	+ If any of the source or destination registers of the instruction
	  matches with the scoreboard entries, then hold the instr. inside the buf
	  but do not yet execute
	+ When the current instruction finishes executing, it will clear some of the
	  registers in the scoreboard, resulting in some instruction from the buffer
	  to be eligible to be executed
	+ When I-Buf is full, stall / drop the fetched instruction
	+ Note: to avoid structural hazards, implement instruction replay


* Operand Collector
	+ When an instruction is issued, it gets decoded, and its operands get collected
	+ Ex: add r1 r2 r3 ->
	      -> 2 regs r2 and r3 to be read, added, and then stored inside r1
	+ When the warp issues the instruction, it goes to a "collector unit"
	+ For the example, assume warp w0 issues the add instruction => we get a collector unit
	  [Note: operand data field - 32 (warp size) x 4 bytes (reg size: 32-bit) = 128 bytes]
	  +----------------------------------------+
	  |                   w0                   |
	  +-----------+-----+-------+--------------+
	  | valid bit | reg | ready | operand data |
	  +-----------+-----+-------+--------------+
	  |     1     | r2  | 1     | 1, 9, ..., 4 |
	  |     1     | r2  | 0     | 3, -, ..., - |
	  |     0     | -   | -     | -            |
	  +-----------+-----+-------+--------------+
	+ When all operands are ready, the instruction is issued to the SIMD exec unit
	+ To avoid WAR hazards, just allow unique warps for collector units
	  (or add a separate bloomboard filter for less slowdown but more complex)


* Warp Compaction
	+ Idea: warps execute "in a same manner"
	  Ex: assume 10 warps executing and reaching a branch at the same time
	   => each warp will have 16 threads going to branch B1 and other half to B2
	   => can rearrange the static threads into different warps s.t.
	      5 warps' threads execute B1 and other 5 warps execute B2
	   [Called Dynamic Warp Formation]
	   Issue: non-coalesced memory accesses + can starve some threads!
	+ Other methods mostly focus on thread-block level compaction
	  => sync-ing warps between each other instead of threads within them
	+ Look into software compaction (later though)
	+ Warp Scalarization? If multiple threads operate on same data, their output is the same => 
	  just run on one thread


* Register File
	+ Thousands of in-flight registers, can consume ~10% of total power
	+ Implement a register file cache (RFC)
	+ RFC allocates a new entry (FIFO replacements) for the dest operand of instructions
	+ After an entry is kicked out, write the value back to the reg file
	+ Mark registers in the RFC that are only read (dead bit) => no need to write back to reg file
	+ Categorize warps into 'active' and 'non-active' warps;
	  allow access to RFC only to 'active' warps
	  if warp becomes 'inactive', remove its entries from RFC
	+ More efficient to do with the compiler (operand register file),
	  s.t. compiler decides which values to move to OFC
	+ Other ways to reduce reg file area / usage are targeted more torwards ASICs than FPGAs


===================================


+-----------------------+
|     Memory System     |
+-----------------------+


* Scratchpad/Shared Memory & L1 Data Cache
	+ "scratchpad memory" refers to the memory accessible to all threads in the CTA
	+ NOTE: due to each SIMT Core having their own L1 cache,
	  L1 cache should only store "local memory" data and global read-only mem data
	+ Memory access request:
		1. Mem access req is sent from the load/store unit to the L1 cache
		   Mem access reqs consist of mem addrs (one for each thread in the warp) + operation type (ld/st)
		2. If the reqs cause bank conflicts -> split the reqs into two groups: [Arbiter]
		   threads' reqs that do NOT cause bank conflicts and those that cause
		   only execute the ones that do not cause conflicts and replay the instructions of the other group
		3. Memory is direct mapped => tag unit then determines the bank each thread will be using
		4. The data is looked up in the bank and then returned to the thread's lane using data crossbar
		5. If a cache miss is registered, the req goes to Pending Request Table (PRT)
		6. Reqs from PRT go to the MMU, which translated the virt. addr to physical one
		   then sends to L2 cache -> DRAM
		   the response is sent back to Fill Unit that fills the L1 cache with the results
		   and passes the info to the arbiter, which then re-reqs the loaded value (guaranteed hit this time)
		7. All data write reqs go to a "Write Buffer", which forwards the requests via
		   data crossbar to local reg file or to the MMU for global memory writes


* L1 Texture Cache
	+ For now, make the memory unified with L1 Data Cache (since only read-only values are cached)


* On-Chip Interconnection Network
	+ For high mem bandwidth, GPUs are connected to multiple DRAMs
	+ Mem traffic gets distributed accross then mem partitions usign addr interleaving (crossbar)


* Memory Partition Unit
	+ Contains a portion of L2 cache
	+ Contains a mem access scheduler ("frame buffer")
	+ Note: each mem partition is responsible for an exclusive addr range
	  => L2 cache can be written to and read from (no need for const data only)


===================================


+------------------------------+
| Instruction Set Architecture |
+------------------------------+


===================================


+------------------------------+
| Kernel Code Compiler for ISA |
+------------------------------+

+ Will make the assembler
+ Have never built a compiler before, so will be learning it for this project


===================================


References:
	1. Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book
	   https://link.springer.com/book/10.1007/978-3-031-01759-9

	2. GPGPU-Sim's Manual
	   http://gpgpu-sim.org/manual/index.php/Main_Page