summaryrefslogtreecommitdiff
path: root/spec/notes.txt
blob: dc753db2f84fb5ce18649d162a31546289d2f61f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
*** Research Notes for the Project ***


+---------------------------------------------------------------+
| NOTE: INCLUDES ONLY THE ABSOLUTE BASICS OF A GPU ARCHITECTURE |
+---------------------------------------------------------------+


* GPU Parts:
	1. Compute Cores
	2. Memory Controllers/Partitions
	3. Driver

* High-Level Architecture:
	1. Computing Cores
		+ threads grouped into warps (e.g. 32 threads)
		+ warps executed separately
		+ when branch divergence for warps' threads -> [Thread Reconvergence]
		+ for selecting which warp to execute -> [Warp Scheduling]
		+ each warp can execute multiple instructions before switching -> [Instruction Flow]
	2. Memory Controllers
		+ 


===================================


+-----------------------+
|       SIMT Core       |
+-----------------------+


* Thread Reconvergence
	Two Options:
	1) [SIMT] Stack-based: (easier to implement ; potential deadlocks)
		+ Runs through all the instructions serially
		+ Has an "Active Mask" -> which threads in the warp should execute the instruction
		+ Implemented via a stack that stores | Return PC | Next PC | Active Mask | columns
		+ Top of the stack -> the active mask to be used
		+ When branch divergence -> change next PC of current TOS ; add two new rows for the new branches
		+ TOS should be the branch with the most active threads when diverging
		+ Note: seems like might need to do some post-compilation inference work ??
		  (for determining return / reconv. PCs, idk yet, or maybe do it during runtime)
		+ For warp-splits, implement Multi-Path Stack, so no idleing left
		+ Example:
			```
			Assuming the following kernel:

			A;
			if (id % 2 == 0) {
				B;
				if (id == 0) {
					C;
				} else {
					D;
				}
				E;
			} else {
				F;
			}
			G;

			Assume 4 threads in a warp (ids from 0 to 3)

			Initially the stack looks like this:
			reconv. PC | next PC | Active Mask
			null       | A       | 1 1 1 1      <- TOS

			When 1st branch divergence occurs:
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1      <- change A to G since that's where threads will reconv.
			G          | F       | 0 1 0 1
			G          | B       | 1 0 1 0      <- TOS

			When B finishes executing, another divergence
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1
			G          | F       | 0 1 0 1
			G          | E       | 1 0 1 0      <- change to E
			E          | D       | 0 0 1 0
			E          | C       | 1 0 0 0      <- TOS

			After C finishes and PC goes to E -> pop the stack
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1
			G          | F       | 0 1 0 1
			G          | E       | 1 0 1 0 
			E          | D       | 0 0 1 0      <- TOS

			D finishes; pop the stack (PC D gets to E)
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1
			G          | F       | 0 1 0 1
			G          | E       | 1 0 1 0      <- TOS

			...

			so on, reaching to PC G

			```
	2) [SIMT] Stack-less: (harder to implement ; better performance ; no deadlocks w/ yielding)
		+ Add a barrier participation mask
		+ Thread active bits -> track whether the thread is being executed or not
		  => getting group of threads that the scheduler that can decide between to execute
		  (group threads by same PC)
		+ Thread States -> ready to execute or blocked (no yielding yet for deadlock)
		+ When reaching "add_barrier [barrier]" instruction
		  creates new barrier mask, includes all the active threads inside the mask
		+ When reaching "wait [barrier]" ->
		  -> execute NOP until all threads in the barrier got to this point as well
			 (track using a barrier state field)
		+ Example:
			```
			Assume the following kernel:

			A;
			add_barrier b0;
			if (id % 2 == 0) {
				B;
			} else {
				C;
				add_barrier b1;
				if (id == 1) {
					D;
				} else {
					E;
				}
				wait b1;
				F;
			}
			wait b0;
			G;

			Imagine 8 threads per warp
			During A:
			All trheads are active:

			0 1 2 3 4 5 6 7
			---------------
			1 1 1 1 1 1 1 1 <- active threads

			Adding barrier b0:
			Creating new barrier participation mask:

			0 1 2 3 4 5 6 7
			---------------
			1 1 1 1 1 1 1 1 <- b1 barrier mask
			0 0 0 0 0 0 0 0 <- b1 barrier state [initial]


			Half of threads go to B and other half to C => two thread groups

			0 1 2 3 4 5 6 7
			---------------
			1 0 1 0 1 0 1 0 <- warp split (two groups)

			Execute one thread group at a time

			Adding new barrier b2:
			Active threads are the half executing C with odd thread ids

			0 1 2 3 4 5 6 7
			---------------
			0 1 0 1 0 1 0 1 <- b2 barrier mask
			0 0 0 0 0 0 0 0 <- b2 barrier state [initial]

			One thread jumps to D and other odd id-ed threads jump to E
			=> now there are three thread groups to schedule and execute

			Reaching `wait b1`
			Threads who reach this -> get state blocked (not executable) =>
			=> thus, when executing thread group, blocked ones should get "ignored"

			When b2 barrer sate == b2 barrier mask, 
			Make all threads executable again => two thread groups remaining 
			(since two different PCs that threads are at to execute)

			Similar when waiting for b1 barrier
			```


* Warp Scheduling
	+ Round-Robin (RR) scheduling
		> same # of instructions executed per warp => all warps get equal attention
	+ Greedy-then-Oldest (GTO) scheduling
		> when a warp stalls -> offload it and load in the oldest warp
	+ Choose whichever one for now, upgrade later if needed
	------
	  Ex: execute warp instructions until an async mem req was issued (w/ cache miss)


* Instruction Flow
	+ Each warp having access to an "Instruction Buffer" (I-Buf)
	+ Each warp has a "Scoreboard" assigned to it
	  with 3 or 4 entries, each entry an id of a register
	  Ex: containing 3 entries R1, R5, R8
	+ Instruction get fetched from the Instruction Cache (I-Cache)
	+ If any of the source or destination registers of the instruction
	  matches with the scoreboard entries, then hold the instr. inside the buf
	  but do not yet execute
	+ When the current instruction finishes executing, it will clear some of the
	  registers in the scoreboard, resulting in some instruction from the buffer
	  to be eligible to be executed
	+ When I-Buf is full, stall / drop the fetched instruction
	+ Note: to avoid structural hazards, implement instruction replay


* Operand Collector
	+ When an instruction is issued, it gets decoded, and its operands get collected
	+ Ex: add r1 r2 r3 ->
	      -> 2 regs r2 and r3 to be read, added, and then stored inside r1
	+ When the warp issues the instruction, it goes to a "collector unit"
	+ For the example, assume warp w0 issues the add instruction => we get a collector unit
	  [Note: operand data field - 32 (warp size) x 4 bytes (reg size: 32-bit) = 128 bytes]
	  +----------------------------------------+
	  |                   w0                   |
	  +-----------+-----+-------+--------------+
	  | valid bit | reg | ready | operand data |
	  +-----------+-----+-------+--------------+
	  |     1     | r2  | 1     | 1, 9, ..., 4 |
	  |     1     | r2  | 0     | 3, -, ..., - |
	  |     0     | -   | -     | -            |
	  +-----------+-----+-------+--------------+
	+ When all operands are ready, the instruction is issued to the SIMD exec unit
	+ To avoid WAR hazards, just allow unique warps for collector units
	  (or add a separate bloomboard filter for less slowdown but more complex)


* Warp Compaction
	+ Idea: warps execute "in a same manner"
	  Ex: assume 10 warps executing and reaching a branch at the same time
	   => each warp will have 16 threads going to branch B1 and other half to B2
	   => can rearrange the static threads into different warps s.t.
	      5 warps' threads execute B1 and other 5 warps execute B2
	   [Called Dynamic Warp Formation]
	   Issue: non-coalesced memory accesses + can starve some threads!
	+ Other methods mostly focus on thread-block level compaction
	  => sync-ing warps between each other instead of threads within them
	+ Look into software compaction (later though)
	+ Warp Scalarization? If multiple threads operate on same data, their output is the same => 
	  just run on one thread


* Register File
	+ Thousands of in-flight registers, can consume ~10% of total power
	+ Implement a register file cache (RFC)
	+ RFC allocates a new entry (FIFO replacements) for the dest operand of instructions
	+ After an entry is kicked out, write the value back to the reg file
	+ Mark registers in the RFC that are only read (dead bit) => no need to write back to reg file
	+ Categorize warps into 'active' and 'non-active' warps;
	  allow access to RFC only to 'active' warps
	  if warp becomes 'inactive', remove its entries from RFC
	+ More efficient to do with the compiler (operand register file),
	  s.t. compiler decides which values to move to OFC
	+ Other ways to reduce reg file area / usage are targeted more torwards ASICs than FPGAs


===================================


+-----------------------+
|     Memory System     |
+-----------------------+


* Scratchpad/Shared Memory & L1 Data Cache
	+ "scratchpad memory" refers to the memory accessible to all threads in the CTA
	+ NOTE: due to each SIMT Core having their own L1 cache,
	  L1 cache should only store "local memory" data and global read-only mem data
	+ Memory access request:
		1. Mem access req is sent from the load/store unit to the L1 cache
		   Mem access reqs consist of mem addrs (one for each thread in the warp) + operation type (ld/st)
		2. If the reqs cause bank conflicts -> split the reqs into two groups: [Arbiter]
		   threads' reqs that do NOT cause bank conflicts and those that cause
		   only execute the ones that do not cause conflicts and replay the instructions of the other group
		3. Memory is direct mapped => tag unit then determines the bank each thread will be using
		4. The data is looked up in the bank and then returned to the thread's lane using data crossbar
		5. If a cache miss is registered, the req goes to Pending Request Table (PRT)
		6. Reqs from PRT go to the MMU, which translated the virt. addr to physical one
		   then sends to L2 cache -> DRAM
		   the response is sent back to Fill Unit that fills the L1 cache with the results
		   and passes the info to the arbiter, which then re-reqs the loaded value (guaranteed hit this time)
		7. All data write reqs go to a "Write Buffer", which forwards the requests via
		   data crossbar to local reg file or to the MMU for global memory writes


* L1 Texture Cache
	+ For now, make the memory unified with L1 Data Cache (since only read-only values are cached)


* On-Chip Interconnection Network
	+ For high mem bandwidth, GPUs are connected to multiple DRAMs
	+ Mem traffic gets distributed accross then mem partitions usign addr interleaving (crossbar)


* Memory Partition Unit
	+ Contains a portion of L2 cache
	+ Contains a mem access scheduler ("frame buffer")
	+ Note: each mem partition is responsible for an exclusive addr range
	  => L2 cache can be written to and read from (no need for const data only)


===================================


+------------------------------+
| Instruction Set Architecture |
+------------------------------+


===================================


+------------------------------+
| Kernel Code Compiler for ISA |
+------------------------------+

+ Will make the assembler
+ Have never built a compiler before, so will be learning it for this project


===================================


References:
	1. Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book
	   https://link.springer.com/book/10.1007/978-3-031-01759-9

	2. GPGPU-Sim's Manual
	   http://gpgpu-sim.org/manual/index.php/Main_Page