spec/notes.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202

*** Research Notes for the Project ***


+---------------------------------------------------------------+
| NOTE: INCLUDES ONLY THE ABSOLUTE BASICS OF A GPU ARCHITECTURE |
+---------------------------------------------------------------+


* GPU Parts:
	1. Compute Cores
	2. Memory Controllers/Partitions
	3. Driver

* High-Level Architecture:
	1. Computing Cores
		+ threads grouped into warps (e.g. 32 threads)
		+ warps executed separately
		+ when branch divergence for warps' threads -> [Thread Reconvergence]
		+ for selecting which warp to execute -> [Warp Scheduling]
		+ each warp can execute multiple instructions before switching -> [Instruction Flow]
	2. Memory Controllers
		+ 


* Thread Reconvergence
	Two Options:
	1) [SIMT] Stack-based: (easier to implement ; potential deadlocks)
		+ Runs through all the instructions serially
		+ Has an "Active Mask" -> which threads in the warp should execute the instruction
		+ Implemented via a stack that stores | Return PC | Next PC | Active Mask | columns
		+ Top of the stack -> the active mask to be used
		+ When branch divergence -> change next PC of current TOS ; add two new rows for the new branches
		+ TOS should be the branch with the most active threads when diverging
		+ Note: seems like might need to do some post-compilation inference work ??
		  (for determining return / reconv. PCs, idk yet, or maybe do it during runtime)
		+ Example:
			```
			Assuming the following kernel:

			A;
			if (id % 2 == 0) {
				B;
				if (id == 0) {
					C;
				} else {
					D;
				}
				E;
			} else {
				F;
			}
			G;

			Assume 4 threads in a warp (ids from 0 to 3)

			Initially the stack looks like this:
			reconv. PC | next PC | Active Mask
			null       | A       | 1 1 1 1      <- TOS

			When 1st branch divergence occurs:
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1      <- change A to G since that's where threads will reconv.
			G          | F       | 0 1 0 1
			G          | B       | 1 0 1 0      <- TOS

			When B finishes executing, another divergence
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1
			G          | F       | 0 1 0 1
			G          | E       | 1 0 1 0      <- change to E
			E          | D       | 0 0 1 0
			E          | C       | 1 0 0 0      <- TOS

			After C finishes and PC goes to E -> pop the stack
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1
			G          | F       | 0 1 0 1
			G          | E       | 1 0 1 0 
			E          | D       | 0 0 1 0      <- TOS

			D finishes; pop the stack (PC D gets to E)
			reconv. PC | next PC | Active Mask
			null       | G       | 1 1 1 1
			G          | F       | 0 1 0 1
			G          | E       | 1 0 1 0      <- TOS

			...

			so on, reaching to PC G

			```
	2) [SIMT] Stack-less: (harder to implement ; better performance ; no deadlocks w/ yielding)
		+ Add a barrier participation mask
		+ Thread active bits -> track whether the thread is being executed or not
		  => getting group of threads that the scheduler that can decide between to execute
		  (group threads by same PC)
		+ Thread States -> ready to execute or blocked (no yielding yet for deadlock)
		+ When reaching "add_barrier [barrier]" instruction
		  creates new barrier mask, includes all the active threads inside the mask
		+ When reaching "wait [barrier]" ->
		  -> execute NOP until all threads in the barrier got to this point as well
			 (track using a barrier state field)
		+ Example:
			```
			Assume the following kernel:

			A;
			add_barrier b0;
			if (id % 2 == 0) {
				B;
			} else {
				C;
				add_barrier b1;
				if (id == 1) {
					D;
				} else {
					E;
				}
				wait b1;
				F;
			}
			wait b0;
			G;

			Imagine 8 threads per warp
			During A:
			All trheads are active:

			0 1 2 3 4 5 6 7
			---------------
			1 1 1 1 1 1 1 1 <- active threads

			Adding barrier b0:
			Creating new barrier participation mask:

			0 1 2 3 4 5 6 7
			---------------
			1 1 1 1 1 1 1 1 <- b1 barrier mask
			0 0 0 0 0 0 0 0 <- b1 barrier state [initial]


			Half of threads go to B and other half to C => two thread groups

			0 1 2 3 4 5 6 7
			---------------
			1 0 1 0 1 0 1 0 <- warp split (two groups)

			Execute one thread group at a time

			Adding new barrier b2:
			Active threads are the half executing C with odd thread ids

			0 1 2 3 4 5 6 7
			---------------
			0 1 0 1 0 1 0 1 <- b2 barrier mask
			0 0 0 0 0 0 0 0 <- b2 barrier state [initial]

			One thread jumps to D and other odd id-ed threads jump to E
			=> now there are three thread groups to schedule and execute

			Reaching `wait b1`
			Threads who reach this -> get state blocked (not executable) =>
			=> thus, when executing thread group, blocked ones should get "ignored"

			When b2 barrer sate == b2 barrier mask, 
			Make all threads executable again => two thread groups remaining 
			(since two different PCs that threads are at to execute)

			Similar when waiting for b1 barrier
			```


* Warp Scheduling
	+ Round-Robin (RR) scheduling
		> same # of instructions executed per warp => all warps get equal attention
	+ Greedy-then-Oldest (GTO) scheduling
		> when a warp stalls -> offload it and load in the oldest warp
	+ Choose whichever one for now, upgrade later if needed
	------
	  Ex: execute warp instructions until an async mem req was issued (w/ cache miss)


* Instruction Flow
	+ Each warp having access to an "Instruction Buffer" (I-Buf)
	+ Each warp has a "Scoreboard" assigned to it
	  with 3 or 4 entries, each entry an id of a register
	  Ex: containing 3 entries R1, R5, R8
	+ Instruction get fetched from the Instruction Cache (I-Cache)
	+ If any of the source of destination registers of the instruction
	  matches with the scoreboard entries, then hold the instr. inside the buf
	  but do not yet execute
	+ When the current instruction finishes executing, it will clear some of the
	  registers in the scoreboard, resulting in some instruction from the buffer
	  to be eligible to be executed
	+ When I-Buf is full, stall / drop the fetched instruction


===================================

References:
	+ Main one as of now: General-Purpose Graphics Processor Architecture (2018) Book
	  https://link.springer.com/book/10.1007/978-3-031-01759-9