add notes for operand collector, warp compaction, reg file

author: Davit Grigoryan <[email protected]> 2024-09-29 14:29:19 -0700
committer: Davit Grigoryan <[email protected]> 2024-09-29 14:29:19 -0700
commit: 4f7f19f1f42871c080b9450394dd828e14953e82 (patch)
tree: 97605a2355dfa81cf50398b1023569e17bb3ea9c /spec
parent: 6daa51bb42c6f26a8159a1b5dfd2cdca820e24de (diff)
1 files changed, 93 insertions, 3 deletions
diff --git a/spec/notes.txt b/spec/notes.txt
index c7e10ca..db518b9 100644
--- a/spec/notes.txt
+++ b/spec/notes.txt
@@ -22,6 +22,14 @@
 		+ 
 
 
+===================================
+
+
++-----------------------+
+|       SIMT Core       |
++-----------------------+
+
+
 * Thread Reconvergence
 	Two Options:
 	1) [SIMT] Stack-based: (easier to implement ; potential deadlocks)
@@ -33,6 +41,7 @@
 		+ TOS should be the branch with the most active threads when diverging
 		+ Note: seems like might need to do some post-compilation inference work ??
 		  (for determining return / reconv. PCs, idk yet, or maybe do it during runtime)
+		+ For warp-splits, implement Multi-Path Stack, so no idleing left
 		+ Example:
 			```
 			Assuming the following kernel:
@@ -186,17 +195,98 @@
 	  with 3 or 4 entries, each entry an id of a register
 	  Ex: containing 3 entries R1, R5, R8
 	+ Instruction get fetched from the Instruction Cache (I-Cache)
-	+ If any of the source of destination registers of the instruction
+	+ If any of the source or destination registers of the instruction
 	  matches with the scoreboard entries, then hold the instr. inside the buf
 	  but do not yet execute
 	+ When the current instruction finishes executing, it will clear some of the
 	  registers in the scoreboard, resulting in some instruction from the buffer
 	  to be eligible to be executed
 	+ When I-Buf is full, stall / drop the fetched instruction
+	+ Note: to avoid structural hazards, implement instruction replay
+
+
+* Operand Collector
+	+ When an instruction is issued, it gets decoded, and its operands get collected
+	+ Ex: add r1 r2 r3 ->
+	      -> 2 regs r2 and r3 to be read, added, and then stored inside r1
+	+ When the warp issues the instruction, it goes to a "collector unit"
+	+ For the example, assume warp w0 issues the add instruction => we get a collector unit
+	  [Note: operand data field - 32 (warp size) x 4 bytes (reg size: 32-bit) = 128 bytes]
+	  +----------------------------------------+
+	  |                   w0                   |
+	  +-----------+-----+-------+--------------+
+	  | valid bit | reg | ready | operand data |
+	  +-----------+-----+-------+--------------+
+	  |     1     | r2  | 1     | 1, 9, ..., 4 |
+	  |     1     | r2  | 0     | 3, -, ..., - |
+	  |     0     | -   | -     | -            |
+	  +-----------+-----+-------+--------------+
+	+ When all operands are ready, the instruction is issued to the SIMD exec unit
+	+ To avoid WAR hazards, just allow unique warps for collector units
+	  (or add a separate bloomboard filter for less slowdown but more complex)
+
+
+* Warp Compaction
+	+ Idea: warps execute "in a same manner"
+	  Ex: assume 10 warps executing and reaching a branch at the same time
+	   => each warp will have 16 threads going to branch B1 and other half to B2
+	   => can rearrange the static threads into different warps s.t.
+	      5 warps' threads execute B1 and other 5 warps execute B2
+	   [Called Dynamic Warp Formation]
+	   Issue: non-coalesced memory accesses + can starve some threads!
+	+ Other methods mostly focus on thread-block level compaction
+	  => sync-ing warps between each other instead of threads within them
+	+ Look into software compaction (later though)
+	+ Warp Scalarization? If multiple threads operate on same data, their output is the same => 
+	  just run on one thread
+
+
+* Register File
+	+ Thousands of in-flight registers, can consume ~10% of total power
+	+ Implement a register file cache (RFC)
+	+ RFC allocates a new entry (FIFO replacements) for the dest operand of instructions
+	+ After an entry is kicked out, write the value back to the reg file
+	+ Mark registers in the RFC that are only read (dead bit) => no need to write back to reg file
+	+ Categorize warps into 'active' and 'non-active' warps;
+	  allow access to RFC only to 'active' warps
+	  if warp becomes 'inactive', remove its entries from RFC
+	+ More efficient to do with the compiler (operand register file),
+	  s.t. compiler decides which values to move to OFC
+	+ Other ways to reduce reg file area / usage are targeted more torwards ASICs than FPGAs
+
+
+===================================
+
+
++-----------------------+
+|     Memory System     |
++-----------------------+
+
+
+===================================
+
+
++------------------------------+
+| Instruction Set Architecture |
++------------------------------+
 
 
 ===================================
 
+
++------------------------------+
+| Kernel Code Compiler for ISA |
++------------------------------+
+
++ Will make the assembler
++ Have never built a compiler before, so will be learning it for this project
+
+===================================
+
+
 References:
-	+ Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book
-	  https://link.springer.com/book/10.1007/978-3-031-01759-9
+	1. Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book
+	   https://link.springer.com/book/10.1007/978-3-031-01759-9
+
+	2. GPGPU-Sim's Manual
+	   http://gpgpu-sim.org/manual/index.php/Main_Page
author	Davit Grigoryan <[email protected]>	2024-09-29 14:29:19 -0700
committer	Davit Grigoryan <[email protected]>	2024-09-29 14:29:19 -0700
commit	4f7f19f1f42871c080b9450394dd828e14953e82 (patch)
tree	97605a2355dfa81cf50398b1023569e17bb3ea9c /spec
parent	6daa51bb42c6f26a8159a1b5dfd2cdca820e24de (diff)