diff options
Diffstat (limited to 'spec/notes.txt')
-rw-r--r-- | spec/notes.txt | 96 |
1 files changed, 93 insertions, 3 deletions
diff --git a/spec/notes.txt b/spec/notes.txt index c7e10ca..db518b9 100644 --- a/spec/notes.txt +++ b/spec/notes.txt @@ -22,6 +22,14 @@ + +=================================== + + ++-----------------------+ +| SIMT Core | ++-----------------------+ + + * Thread Reconvergence Two Options: 1) [SIMT] Stack-based: (easier to implement ; potential deadlocks) @@ -33,6 +41,7 @@ + TOS should be the branch with the most active threads when diverging + Note: seems like might need to do some post-compilation inference work ?? (for determining return / reconv. PCs, idk yet, or maybe do it during runtime) + + For warp-splits, implement Multi-Path Stack, so no idleing left + Example: ``` Assuming the following kernel: @@ -186,17 +195,98 @@ with 3 or 4 entries, each entry an id of a register Ex: containing 3 entries R1, R5, R8 + Instruction get fetched from the Instruction Cache (I-Cache) - + If any of the source of destination registers of the instruction + + If any of the source or destination registers of the instruction matches with the scoreboard entries, then hold the instr. inside the buf but do not yet execute + When the current instruction finishes executing, it will clear some of the registers in the scoreboard, resulting in some instruction from the buffer to be eligible to be executed + When I-Buf is full, stall / drop the fetched instruction + + Note: to avoid structural hazards, implement instruction replay + + +* Operand Collector + + When an instruction is issued, it gets decoded, and its operands get collected + + Ex: add r1 r2 r3 -> + -> 2 regs r2 and r3 to be read, added, and then stored inside r1 + + When the warp issues the instruction, it goes to a "collector unit" + + For the example, assume warp w0 issues the add instruction => we get a collector unit + [Note: operand data field - 32 (warp size) x 4 bytes (reg size: 32-bit) = 128 bytes] + +----------------------------------------+ + | w0 | + +-----------+-----+-------+--------------+ + | valid bit | reg | ready | operand data | + +-----------+-----+-------+--------------+ + | 1 | r2 | 1 | 1, 9, ..., 4 | + | 1 | r2 | 0 | 3, -, ..., - | + | 0 | - | - | - | + +-----------+-----+-------+--------------+ + + When all operands are ready, the instruction is issued to the SIMD exec unit + + To avoid WAR hazards, just allow unique warps for collector units + (or add a separate bloomboard filter for less slowdown but more complex) + + +* Warp Compaction + + Idea: warps execute "in a same manner" + Ex: assume 10 warps executing and reaching a branch at the same time + => each warp will have 16 threads going to branch B1 and other half to B2 + => can rearrange the static threads into different warps s.t. + 5 warps' threads execute B1 and other 5 warps execute B2 + [Called Dynamic Warp Formation] + Issue: non-coalesced memory accesses + can starve some threads! + + Other methods mostly focus on thread-block level compaction + => sync-ing warps between each other instead of threads within them + + Look into software compaction (later though) + + Warp Scalarization? If multiple threads operate on same data, their output is the same => + just run on one thread + + +* Register File + + Thousands of in-flight registers, can consume ~10% of total power + + Implement a register file cache (RFC) + + RFC allocates a new entry (FIFO replacements) for the dest operand of instructions + + After an entry is kicked out, write the value back to the reg file + + Mark registers in the RFC that are only read (dead bit) => no need to write back to reg file + + Categorize warps into 'active' and 'non-active' warps; + allow access to RFC only to 'active' warps + if warp becomes 'inactive', remove its entries from RFC + + More efficient to do with the compiler (operand register file), + s.t. compiler decides which values to move to OFC + + Other ways to reduce reg file area / usage are targeted more torwards ASICs than FPGAs + + +=================================== + + ++-----------------------+ +| Memory System | ++-----------------------+ + + +=================================== + + ++------------------------------+ +| Instruction Set Architecture | ++------------------------------+ =================================== + ++------------------------------+ +| Kernel Code Compiler for ISA | ++------------------------------+ + ++ Will make the assembler ++ Have never built a compiler before, so will be learning it for this project + +=================================== + + References: - + Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book - https://link.springer.com/book/10.1007/978-3-031-01759-9 + 1. Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book + https://link.springer.com/book/10.1007/978-3-031-01759-9 + + 2. GPGPU-Sim's Manual + http://gpgpu-sim.org/manual/index.php/Main_Page |