summaryrefslogtreecommitdiff
path: root/spec
diff options
context:
space:
mode:
authorDavit Grigoryan <[email protected]>2024-09-29 14:29:19 -0700
committerDavit Grigoryan <[email protected]>2024-09-29 14:29:19 -0700
commit4f7f19f1f42871c080b9450394dd828e14953e82 (patch)
tree97605a2355dfa81cf50398b1023569e17bb3ea9c /spec
parent6daa51bb42c6f26a8159a1b5dfd2cdca820e24de (diff)
add notes for operand collector, warp compaction, reg file
Diffstat (limited to 'spec')
-rw-r--r--spec/notes.txt96
1 files changed, 93 insertions, 3 deletions
diff --git a/spec/notes.txt b/spec/notes.txt
index c7e10ca..db518b9 100644
--- a/spec/notes.txt
+++ b/spec/notes.txt
@@ -22,6 +22,14 @@
+
+===================================
+
+
++-----------------------+
+| SIMT Core |
++-----------------------+
+
+
* Thread Reconvergence
Two Options:
1) [SIMT] Stack-based: (easier to implement ; potential deadlocks)
@@ -33,6 +41,7 @@
+ TOS should be the branch with the most active threads when diverging
+ Note: seems like might need to do some post-compilation inference work ??
(for determining return / reconv. PCs, idk yet, or maybe do it during runtime)
+ + For warp-splits, implement Multi-Path Stack, so no idleing left
+ Example:
```
Assuming the following kernel:
@@ -186,17 +195,98 @@
with 3 or 4 entries, each entry an id of a register
Ex: containing 3 entries R1, R5, R8
+ Instruction get fetched from the Instruction Cache (I-Cache)
- + If any of the source of destination registers of the instruction
+ + If any of the source or destination registers of the instruction
matches with the scoreboard entries, then hold the instr. inside the buf
but do not yet execute
+ When the current instruction finishes executing, it will clear some of the
registers in the scoreboard, resulting in some instruction from the buffer
to be eligible to be executed
+ When I-Buf is full, stall / drop the fetched instruction
+ + Note: to avoid structural hazards, implement instruction replay
+
+
+* Operand Collector
+ + When an instruction is issued, it gets decoded, and its operands get collected
+ + Ex: add r1 r2 r3 ->
+ -> 2 regs r2 and r3 to be read, added, and then stored inside r1
+ + When the warp issues the instruction, it goes to a "collector unit"
+ + For the example, assume warp w0 issues the add instruction => we get a collector unit
+ [Note: operand data field - 32 (warp size) x 4 bytes (reg size: 32-bit) = 128 bytes]
+ +----------------------------------------+
+ | w0 |
+ +-----------+-----+-------+--------------+
+ | valid bit | reg | ready | operand data |
+ +-----------+-----+-------+--------------+
+ | 1 | r2 | 1 | 1, 9, ..., 4 |
+ | 1 | r2 | 0 | 3, -, ..., - |
+ | 0 | - | - | - |
+ +-----------+-----+-------+--------------+
+ + When all operands are ready, the instruction is issued to the SIMD exec unit
+ + To avoid WAR hazards, just allow unique warps for collector units
+ (or add a separate bloomboard filter for less slowdown but more complex)
+
+
+* Warp Compaction
+ + Idea: warps execute "in a same manner"
+ Ex: assume 10 warps executing and reaching a branch at the same time
+ => each warp will have 16 threads going to branch B1 and other half to B2
+ => can rearrange the static threads into different warps s.t.
+ 5 warps' threads execute B1 and other 5 warps execute B2
+ [Called Dynamic Warp Formation]
+ Issue: non-coalesced memory accesses + can starve some threads!
+ + Other methods mostly focus on thread-block level compaction
+ => sync-ing warps between each other instead of threads within them
+ + Look into software compaction (later though)
+ + Warp Scalarization? If multiple threads operate on same data, their output is the same =>
+ just run on one thread
+
+
+* Register File
+ + Thousands of in-flight registers, can consume ~10% of total power
+ + Implement a register file cache (RFC)
+ + RFC allocates a new entry (FIFO replacements) for the dest operand of instructions
+ + After an entry is kicked out, write the value back to the reg file
+ + Mark registers in the RFC that are only read (dead bit) => no need to write back to reg file
+ + Categorize warps into 'active' and 'non-active' warps;
+ allow access to RFC only to 'active' warps
+ if warp becomes 'inactive', remove its entries from RFC
+ + More efficient to do with the compiler (operand register file),
+ s.t. compiler decides which values to move to OFC
+ + Other ways to reduce reg file area / usage are targeted more torwards ASICs than FPGAs
+
+
+===================================
+
+
++-----------------------+
+| Memory System |
++-----------------------+
+
+
+===================================
+
+
++------------------------------+
+| Instruction Set Architecture |
++------------------------------+
===================================
+
++------------------------------+
+| Kernel Code Compiler for ISA |
++------------------------------+
+
++ Will make the assembler
++ Have never built a compiler before, so will be learning it for this project
+
+===================================
+
+
References:
- + Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book
- https://link.springer.com/book/10.1007/978-3-031-01759-9
+ 1. Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book
+ https://link.springer.com/book/10.1007/978-3-031-01759-9
+
+ 2. GPGPU-Sim's Manual
+ http://gpgpu-sim.org/manual/index.php/Main_Page