Skip to content
Navigation Menu
{{ message }}
forked from taskflow/taskflow
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathGPUTaskingcudaFlow.html
More file actions
522 lines (513 loc) · 66.8 KB
/
Copy pathGPUTaskingcudaFlow.html
File metadata and controls
522 lines (513 loc) · 66.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>Cookbook » GPU Tasking (cudaFlow) | Taskflow QuickStart</title>
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,400i,600,600i%7CSource+Code+Pro:400,400i,600" />
<link rel="stylesheet" href="m-dark+documentation.compiled.css" />
<link rel="icon" href="favicon.ico" type="image/vnd.microsoft.icon" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="theme-color" content="#22272e" />
</head>
<body>
<header><nav id="navigation">
<div class="m-container">
<div class="m-row">
<span id="m-navbar-brand" class="m-col-t-8 m-col-m-none m-left-m">
<a href="https://taskflow.github.io"><img src="taskflow_logo.png" alt="" />Taskflow</a> <span class="m-breadcrumb">|</span> <a href="index.html" class="m-thin">QuickStart</a>
</span>
<div class="m-col-t-4 m-hide-m m-text-right m-nopadr">
<a href="#search" class="m-doc-search-icon" title="Search" onclick="return showSearch()"><svg style="height: 0.9rem;" viewBox="0 0 16 16">
<path id="m-doc-search-icon-path" d="m6 0c-3.31 0-6 2.69-6 6 0 3.31 2.69 6 6 6 1.49 0 2.85-0.541 3.89-1.44-0.0164 0.338 0.147 0.759 0.5 1.15l3.22 3.79c0.552 0.614 1.45 0.665 2 0.115 0.55-0.55 0.499-1.45-0.115-2l-3.79-3.22c-0.392-0.353-0.812-0.515-1.15-0.5 0.895-1.05 1.44-2.41 1.44-3.89 0-3.31-2.69-6-6-6zm0 1.56a4.44 4.44 0 0 1 4.44 4.44 4.44 4.44 0 0 1-4.44 4.44 4.44 4.44 0 0 1-4.44-4.44 4.44 4.44 0 0 1 4.44-4.44z"/>
</svg></a>
<a id="m-navbar-show" href="#navigation" title="Show navigation"></a>
<a id="m-navbar-hide" href="#" title="Hide navigation"></a>
</div>
<div id="m-navbar-collapse" class="m-col-t-12 m-show-m m-col-m-none m-right-m">
<div class="m-row">
<ol class="m-col-t-6 m-col-m-none">
<li><a href="pages.html">Handbook</a></li>
<li><a href="namespaces.html">Namespaces</a></li>
</ol>
<ol class="m-col-t-6 m-col-m-none" start="3">
<li><a href="annotated.html">Classes</a></li>
<li><a href="files.html">Files</a></li>
<li class="m-show-m"><a href="#search" class="m-doc-search-icon" title="Search" onclick="return showSearch()"><svg style="height: 0.9rem;" viewBox="0 0 16 16">
<use href="#m-doc-search-icon-path" />
</svg></a></li>
</ol>
</div>
</div>
</div>
</div>
</nav></header>
<main><article>
<div class="m-container m-container-inflatable">
<div class="m-row">
<div class="m-col-l-10 m-push-l-1">
<h1>
<span class="m-breadcrumb"><a href="Cookbook.html">Cookbook</a> »</span>
GPU Tasking (cudaFlow)
</h1>
<div class="m-block m-default">
<h3>Contents</h3>
<ul>
<li><a href="#GPUTaskingcudaFlowIncludeTheHeader">Include the Header</a></li>
<li><a href="#Create_a_cudaFlow">Create a cudaFlow</a></li>
<li><a href="#Compile_a_cudaFlow_program">Compile a cudaFlow Program</a></li>
<li><a href="#run_a_cudaflow_on_a_specific_gpu">Run a cudaFlow on Specific GPU</a></li>
<li><a href="#GPUMemoryOperations">Create Memory Operation Tasks</a></li>
<li><a href="#StudyThecudaFlowGranularity">Study the Granularity</a></li>
<li><a href="#OffloadAcudaFlow">Offload a cudaFlow</a></li>
<li><a href="#UpdateAcudaFlow">Update a cudaFlow</a></li>
<li><a href="#UsecudaFlowInAStandaloneEnvironment">Use cudaFlow in a Standalone Environment</a></li>
</ul>
</div>
<p>Modern scientific computing typically leverages GPU-powered parallel processing cores to speed up large-scale applications. This chapter discusses how to implement CPU-GPU heterogeneous tasking algorithms with <a href="https://developer.nvidia.com/cuda-zone">Nvidia CUDA</a>.</p><section id="GPUTaskingcudaFlowIncludeTheHeader"><h2><a href="#GPUTaskingcudaFlowIncludeTheHeader">Include the Header</a></h2><p>You need to include the header file, <code>taskflow/cuda/cudaflow.hpp</code>, for creating a <a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a> task.</p></section><section id="Create_a_cudaFlow"><h2><a href="#Create_a_cudaFlow">Create a cudaFlow</a></h2><p>Taskflow leverages <a href="https://developer.nvidia.com/blog/cuda-graphs/">CUDA Graph</a> to enable concurrent CPU-GPU tasking using a task graph model, <a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a>. A cudaFlow is a task in a taskflow and is associated with a CUDA graph to execute multiple dependent GPU operations in a single CPU call. To create a cudaFlow task, emplace a callable with an argument of type <a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a>. The following example implements the canonical saxpy (A·X Plus Y) task graph using <a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a>.</p><pre class="m-code"> <span class="mi">1</span><span class="o">:</span> <span class="err">#</span><span class="n">include</span> <span class="o"><</span><span class="n">taskflow</span><span class="o">/</span><span class="n">cuda</span><span class="o">/</span><span class="n">cudaflow</span><span class="p">.</span><span class="n">hpp</span><span class="o">></span>
<span class="mi">2</span><span class="o">:</span>
<span class="mi">3</span><span class="o">:</span> <span class="c1">// saxpy (single-precision A·X Plus Y) kernel</span>
<span class="mi">4</span><span class="o">:</span> <span class="n">__global__</span> <span class="kt">void</span> <span class="n">saxpy</span><span class="p">(</span><span class="kt">int</span> <span class="n">n</span><span class="p">,</span> <span class="kt">float</span> <span class="n">a</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="o">*</span><span class="n">y</span><span class="p">)</span> <span class="p">{</span>
<span class="mi">5</span><span class="o">:</span> <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span><span class="o">*</span><span class="n">blockDim</span><span class="p">.</span><span class="n">x</span> <span class="o">+</span> <span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
<span class="mi">6</span><span class="o">:</span> <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o"><</span> <span class="n">n</span><span class="p">)</span> <span class="p">{</span>
<span class="mi">7</span><span class="o">:</span> <span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="o">*</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+</span> <span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="mi">8</span><span class="o">:</span> <span class="p">}</span>
<span class="mi">9</span><span class="o">:</span> <span class="p">}</span>
<span class="mi">10</span><span class="o">:</span>
<span class="mi">11</span><span class="o">:</span> <span class="c1">// main function begins</span>
<span class="mi">12</span><span class="o">:</span> <span class="kt">int</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="mi">13</span><span class="o">:</span>
<span class="mi">14</span><span class="o">:</span> <span class="n">tf</span><span class="o">::</span><span class="n">Taskflow</span> <span class="n">taskflow</span><span class="p">;</span>
<span class="mi">15</span><span class="o">:</span> <span class="n">tf</span><span class="o">::</span><span class="n">Executor</span> <span class="n">executor</span><span class="p">;</span>
<span class="mi">16</span><span class="o">:</span>
<span class="mi">17</span><span class="o">:</span> <span class="k">const</span> <span class="kt">unsigned</span> <span class="n">N</span> <span class="o">=</span> <span class="mi">1</span><span class="o"><<</span><span class="mi">20</span><span class="p">;</span> <span class="c1">// size of the vector</span>
<span class="mi">18</span><span class="o">:</span>
<span class="mi">19</span><span class="o">:</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="kt">float</span><span class="o">></span> <span class="n">hx</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="mf">1.0f</span><span class="p">);</span> <span class="c1">// x vector at host</span>
<span class="mi">20</span><span class="o">:</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="kt">float</span><span class="o">></span> <span class="n">hy</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="mf">2.0f</span><span class="p">);</span> <span class="c1">// y vector at host</span>
<span class="mi">21</span><span class="o">:</span>
<span class="mi">22</span><span class="o">:</span> <span class="kt">float</span> <span class="o">*</span><span class="n">dx</span><span class="p">{</span><span class="k">nullptr</span><span class="p">};</span> <span class="c1">// x vector at device</span>
<span class="mi">23</span><span class="o">:</span> <span class="kt">float</span> <span class="o">*</span><span class="n">dy</span><span class="p">{</span><span class="k">nullptr</span><span class="p">};</span> <span class="c1">// y vector at device</span>
<span class="mi">24</span><span class="o">:</span>
<span class="mi">25</span><span class="o">:</span> <span class="n">tf</span><span class="o">::</span><span class="n">Task</span> <span class="n">allocate_x</span> <span class="o">=</span> <span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">(</span>
<span class="mi">26</span><span class="o">:</span> <span class="p">[</span><span class="o">&</span><span class="p">](){</span> <span class="n">cudaMalloc</span><span class="p">(</span><span class="o">&</span><span class="n">dx</span><span class="p">,</span> <span class="n">N</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));}</span>
<span class="mi">27</span><span class="o">:</span> <span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"allocate_x"</span><span class="p">);</span>
<span class="mi">28</span><span class="o">:</span>
<span class="mi">29</span><span class="o">:</span> <span class="n">tf</span><span class="o">::</span><span class="n">Task</span> <span class="n">allocate_y</span> <span class="o">=</span> <span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">(</span>
<span class="mi">30</span><span class="o">:</span> <span class="p">[</span><span class="o">&</span><span class="p">](){</span> <span class="n">cudaMalloc</span><span class="p">(</span><span class="o">&</span><span class="n">dy</span><span class="p">,</span> <span class="n">N</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));}</span>
<span class="mi">31</span><span class="o">:</span> <span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"allocate_y"</span><span class="p">);</span>
<span class="mi">32</span><span class="o">:</span>
<span class="mi">33</span><span class="o">:</span> <span class="n">tf</span><span class="o">::</span><span class="n">Task</span> <span class="n">cudaflow</span> <span class="o">=</span> <span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([</span><span class="o">&</span><span class="p">](</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cf</span><span class="p">)</span> <span class="p">{</span>
<span class="mi">34</span><span class="o">:</span> <span class="c1">// create data transfer tasks</span>
<span class="mi">35</span><span class="o">:</span> <span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">h2d_x</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">dx</span><span class="p">,</span> <span class="n">hx</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"h2d_x"</span><span class="p">);</span>
<span class="mi">36</span><span class="o">:</span> <span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">h2d_y</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">dy</span><span class="p">,</span> <span class="n">hy</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"h2d_y"</span><span class="p">);</span>
<span class="mi">37</span><span class="o">:</span> <span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">d2h_x</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">hx</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">dx</span><span class="p">,</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"d2h_x"</span><span class="p">);</span>
<span class="mi">38</span><span class="o">:</span> <span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">d2h_y</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">hy</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">dy</span><span class="p">,</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"d2h_y"</span><span class="p">);</span>
<span class="mi">39</span><span class="o">:</span>
<span class="mi">40</span><span class="o">:</span> <span class="c1">// launch saxpy<<<(N+255)/256, 256, 0>>>(N, 2.0f, dx, dy)</span>
<span class="mi">41</span><span class="o">:</span> <span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">kernel</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">kernel</span><span class="p">(</span>
<span class="mi">42</span><span class="o">:</span> <span class="p">(</span><span class="n">N</span><span class="o">+</span><span class="mi">255</span><span class="p">)</span><span class="o">/</span><span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">saxpy</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="mf">2.0f</span><span class="p">,</span> <span class="n">dx</span><span class="p">,</span> <span class="n">dy</span>
<span class="mi">43</span><span class="o">:</span> <span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"saxpy"</span><span class="p">);</span>
<span class="mi">44</span><span class="o">:</span>
<span class="mi">45</span><span class="o">:</span> <span class="n">kernel</span><span class="p">.</span><span class="n">succeed</span><span class="p">(</span><span class="n">h2d_x</span><span class="p">,</span> <span class="n">h2d_y</span><span class="p">)</span>
<span class="mi">46</span><span class="o">:</span> <span class="p">.</span><span class="n">precede</span><span class="p">(</span><span class="n">d2h_x</span><span class="p">,</span> <span class="n">d2h_y</span><span class="p">);</span>
<span class="mi">48</span><span class="o">:</span> <span class="p">}).</span><span class="n">name</span><span class="p">(</span><span class="s">"saxpy"</span><span class="p">);</span>
<span class="mi">49</span><span class="o">:</span> <span class="n">cudaflow</span><span class="p">.</span><span class="n">succeed</span><span class="p">(</span><span class="n">allocate_x</span><span class="p">,</span> <span class="n">allocate_y</span><span class="p">);</span> <span class="c1">// overlap memory alloc</span>
<span class="mi">50</span><span class="o">:</span>
<span class="mi">51</span><span class="o">:</span> <span class="n">executor</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">taskflow</span><span class="p">).</span><span class="n">wait</span><span class="p">();</span>
<span class="mi">52</span><span class="o">:</span>
<span class="mi">53</span><span class="o">:</span> <span class="n">taskflow</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">cout</span><span class="p">);</span> <span class="c1">// dump the taskflow</span>
<span class="mi">54</span><span class="o">:</span> <span class="p">}</span></pre><div class="m-graph"><svg style="width: 31.125rem; height: 14.562rem;" viewBox="0.00 0.00 498.27 232.77">
<g transform="scale(1 1) rotate(0) translate(4 228.7696)">
<title>Taskflow</title>
<g class="m-cluster">
<title>cluster_p0x55b2191178a8</title>
<polygon points="8,-47.3848 8,-180.3848 482.2742,-180.3848 482.2742,-47.3848 8,-47.3848"/>
<text text-anchor="middle" x="245.1371" y="-163.5848">cudaFlow: saxpy</text>
</g>
<g class="m-node m-flat">
<title>p0x55b219117698</title>
<ellipse cx="307.6848" cy="-206.3848" rx="68.6788" ry="18.2703"/>
<text text-anchor="middle" x="307.6848" y="-202.5848">allocate_x</text>
</g>
<g class="m-node">
<title>p0x55b2191178a8</title>
<polygon points="474.2742,-118.3848 471.2742,-122.3848 450.2742,-122.3848 447.2742,-118.3848 412.2742,-118.3848 412.2742,-82.3848 474.2742,-82.3848 474.2742,-118.3848"/>
<text text-anchor="middle" x="443.2742" y="-96.5848">saxpy</text>
</g>
<g class="m-edge">
<title>p0x55b219117698->p0x55b2191178a8</title>
<path d="M358.3808,-193.9235C364.6824,-191.2971 370.8188,-188.1487 376.2742,-184.3848 397.914,-169.4545 416.0339,-145.5039 427.9507,-126.9903"/>
<polygon points="430.9827,-128.7423 433.2907,-118.4019 425.0381,-125.0461 430.9827,-128.7423"/>
</g>
<g class="m-node m-flat">
<title>p0x55b2191177a0</title>
<ellipse cx="307.6848" cy="-18.3848" rx="68.6788" ry="18.2703"/>
<text text-anchor="middle" x="307.6848" y="-14.5848">allocate_y</text>
</g>
<g class="m-edge">
<title>p0x55b2191177a0->p0x55b2191178a8</title>
<path d="M353.0018,-32.3015C360.9846,-35.4758 369.0451,-39.176 376.2742,-43.3848 391.323,-52.1462 406.258,-64.4108 418.1951,-75.3198"/>
<polygon points="415.8889,-77.9553 425.5776,-82.2423 420.677,-72.849 415.8889,-77.9553"/>
</g>
<g class="m-node m-flat">
<title>p0x7f2870401a50</title>
<ellipse cx="60.5477" cy="-128.3848" rx="44.5955" ry="18.2703"/>
<text text-anchor="middle" x="60.5477" y="-124.5848">h2d_x</text>
</g>
<g class="m-node">
<title>p0x7f2870402bc0</title>
<polygon points="203.0955,-118.3848 145.0955,-118.3848 141.0955,-114.3848 141.0955,-82.3848 199.0955,-82.3848 203.0955,-86.3848 203.0955,-118.3848"/>
<polyline points="199.0955,-114.3848 141.0955,-114.3848 "/>
<polyline points="199.0955,-114.3848 199.0955,-82.3848 "/>
<polyline points="199.0955,-114.3848 203.0955,-118.3848 "/>
<text text-anchor="middle" x="172.0955" y="-96.5848">saxpy</text>
</g>
<g class="m-edge">
<title>p0x7f2870401a50->p0x7f2870402bc0</title>
<path d="M98.6241,-118.8271C109.0508,-116.2098 120.3981,-113.3615 130.9863,-110.7037"/>
<polygon points="132.0035,-114.0571 140.8505,-108.2277 130.2993,-107.2677 132.0035,-114.0571"/>
</g>
<g class="m-node m-flat">
<title>p0x7f2870402310</title>
<ellipse cx="307.6848" cy="-73.3848" rx="44.5955" ry="18.2703"/>
<text text-anchor="middle" x="307.6848" y="-69.5848">d2h_x</text>
</g>
<g class="m-edge">
<title>p0x7f2870402bc0->p0x7f2870402310</title>
<path d="M203.2024,-94.1904C219.2987,-90.9852 239.3435,-86.9936 257.5589,-83.3664"/>
<polygon points="258.3298,-86.7817 267.4537,-81.396 256.9627,-79.9165 258.3298,-86.7817"/>
</g>
<g class="m-node m-flat">
<title>p0x7f2870402780</title>
<ellipse cx="307.6848" cy="-128.3848" rx="44.5955" ry="18.2703"/>
<text text-anchor="middle" x="307.6848" y="-124.5848">d2h_y</text>
</g>
<g class="m-edge">
<title>p0x7f2870402bc0->p0x7f2870402780</title>
<path d="M203.2024,-106.8085C219.3647,-110.1461 239.5079,-114.3058 257.7828,-118.0797"/>
<polygon points="257.2049,-121.5342 267.7062,-120.129 258.6207,-114.6788 257.2049,-121.5342"/>
</g>
<g class="m-node m-flat">
<title>p0x7f2870401eb0</title>
<ellipse cx="60.5477" cy="-73.3848" rx="44.5955" ry="18.2703"/>
<text text-anchor="middle" x="60.5477" y="-69.5848">h2d_y</text>
</g>
<g class="m-edge">
<title>p0x7f2870401eb0->p0x7f2870402bc0</title>
<path d="M99.2487,-82.7523C109.4425,-85.2197 120.4785,-87.8909 130.8031,-90.39"/>
<polygon points="130.2904,-93.8669 140.8331,-92.8178 131.9372,-87.0634 130.2904,-93.8669"/>
</g>
<g class="m-edge">
<title>p0x7f2870402310->p0x55b2191178a8</title>
<path d="M347.999,-81.4126C364.9361,-84.7853 384.6278,-88.7065 401.698,-92.1057"/>
<polygon points="401.4487,-95.6247 411.9397,-94.1451 402.8158,-88.7595 401.4487,-95.6247"/>
</g>
<g class="m-edge">
<title>p0x7f2870402780->p0x55b2191178a8</title>
<path d="M347.6327,-120.1353C364.872,-116.5753 385.0189,-112.4148 402.3641,-108.8329"/>
<polygon points="403.1242,-112.2499 412.2097,-106.7998 401.7085,-105.3946 403.1242,-112.2499"/>
</g>
</g>
</svg>
</div><p>Debrief:</p><ul><li>Lines 3-9 define a saxpy kernel using CUDA</li><li>Lines 19-20 declare two host vectors, <code>hx</code> and <code>hy</code></li><li>Lines 22-23 declare two device vector pointers, <code>dx</code> and <code>dy</code></li><li>Lines 25-31 declare two tasks to allocate memory for <code>dx</code> and <code>dy</code> on device, each of <code>N*sizeof(float)</code> bytes</li><li>Lines 33-48 create a cudaFlow to define a GPU task graph that contains:<ul><li>two host-to-device data transfer tasks</li><li>one saxpy kernel task</li><li>two device-to-host data transfer tasks</li></ul></li><li>Lines 49-53 define the task dependency between host tasks and the cudaFlow tasks and execute the taskflow</li></ul>
<p><a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a> is a lightweight abstraction over CUDA <a href="classtf_1_1Graph.html" class="m-doc">Graph</a>. We do not expend yet another effort on simplifying kernel programming but focus on tasking CUDA operations and their dependencies. This organization lets users fully take advantage of CUDA featuress that are commensurate with their domain knowledge, while leaving difficult task parallelism details to Taskflow.</p></section><section id="Compile_a_cudaFlow_program"><h2><a href="#Compile_a_cudaFlow_program">Compile a cudaFlow Program</a></h2><p>Use <a href="https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html">nvcc</a> to compile a cudaFlow program:</p><pre class="m-console"><span class="go">~$ nvcc -std=c++17 my_cudaflow.cu -I path/to/include/taskflow -O2 -o my_cudaflow</span>
<span class="go">~$ ./my_cudaflow</span></pre><p>Please visit the page <a href="CompileTaskflowWithCUDA.html" class="m-doc">Compile Taskflow with CUDA</a> for more details.</p></section><section id="run_a_cudaflow_on_a_specific_gpu"><h2><a href="#run_a_cudaflow_on_a_specific_gpu">Run a cudaFlow on Specific GPU</a></h2><p>By default, a cudaFlow runs on the current CUDA GPU associated with the caller, which is typically GPU <code>0</code>. Each CUDA GPU has an integer identifier in the range of <code>[0, N)</code>, where <code>N</code> is the number of CUDA GPUs in a system. You can run a <a href="classtf_1_1cudaFlow.html" class="m-doc">cudaFlow</a> on a specific GPU using <a href="classtf_1_1FlowBuilder.html#afdf47fd1a358fb64f8c1b89e2a393169" class="m-doc">tf::<wbr />Taskflow::<wbr />emplace_on</a>. The code below creates a <a href="classtf_1_1cudaFlow.html" class="m-doc">cudaFlow</a> that runs on GPU <code>2</code>.</p><pre class="m-code"><span class="n">taskflow</span><span class="p">.</span><span class="n">emplace_on</span><span class="p">([]</span> <span class="p">(</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cudaflow</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// here, cudaflow is under GPU 2</span>
<span class="c1">// ...</span>
<span class="p">},</span> <span class="mi">2</span><span class="p">);</span> <span class="c1">// place the cudaFlow on GPU 2</span></pre><aside class="m-note m-warning"><h4>Attention</h4><p><a href="classtf_1_1FlowBuilder.html#afdf47fd1a358fb64f8c1b89e2a393169" class="m-doc">tf::<wbr />Taskflow::<wbr />emplace_on</a> allows you to place a cudaFlow on a particular GPU device, but it is your responsibility to ensure correct memory access. For example, you may not allocate a memory block on GPU <code>2</code> while accessing it from a kernel on GPU <code>0</code>.</p></aside><p>An easy practice is to allocate <em>unified shared memory</em> using <code>cudaMallocManaged</code> and let the CUDA runtime perform automatic memory migration between GPUs.</p></section><section id="GPUMemoryOperations"><h2><a href="#GPUMemoryOperations">Create Memory Operation Tasks</a></h2><p><a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a> provides a set of methods for users to manipulate device memory. There are two categories, <em>raw</em> data and <em>typed</em> data. Raw data operations are methods with prefix <code>mem</code>, such as <code>memcpy</code> and <code>memset</code>, that operate in <em>bytes</em>. Typed data operations such as <code>copy</code>, <code>fill</code>, and <code>zero</code>, take <em>logical count</em> of elements. For instance, the following three methods have the same result of zeroing <code>sizeof(int)*count</code> bytes of the device memory area pointed to by <code>target</code>.</p><pre class="m-code"><span class="kt">int</span><span class="o">*</span> <span class="n">target</span><span class="p">;</span>
<span class="n">cudaMalloc</span><span class="p">(</span><span class="o">&</span><span class="n">target</span><span class="p">,</span> <span class="n">count</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">));</span>
<span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([</span><span class="o">&</span><span class="p">](</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cf</span><span class="p">){</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">memset_target</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">memset</span><span class="p">(</span><span class="n">target</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="o">*</span> <span class="n">count</span><span class="p">);</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">same_as_above</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">fill</span><span class="p">(</span><span class="n">target</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">same_as_above_again</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">zero</span><span class="p">(</span><span class="n">target</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
<span class="p">});</span></pre><p>The method <a href="classtf_1_1cudaFlow.html#a21d4447bc834f4d3e1bb4772c850d090" class="m-doc">cudaFlow::<wbr />fill</a> is a more powerful version of <a href="classtf_1_1cudaFlow.html#a079ca65da35301e5aafd45878a19e9d2" class="m-doc">cudaFlow::<wbr />memset</a>. It can fill a memory area with any value of type <code>T</code>, given that <code>sizeof(T)</code> is 1, 2, or 4 bytes. For example, the following code sets each element in the array <code>target</code> to 1234.</p><pre class="m-code"><span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([</span><span class="o">&</span><span class="p">](</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cf</span><span class="p">){</span> <span class="n">cf</span><span class="p">.</span><span class="n">fill</span><span class="p">(</span><span class="n">target</span><span class="p">,</span> <span class="mi">1234</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span> <span class="p">});</span></pre><p>Similar concept applies to <a href="classtf_1_1cudaFlow.html#ad37637606f0643f360e9eda1f9a6e559" class="m-doc">cudaFlow::<wbr />memcpy</a> and <a href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f" class="m-doc">cudaFlow::<wbr />copy</a> as well.</p><pre class="m-code"><span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([</span><span class="o">&</span><span class="p">](</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cf</span><span class="p">){</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">memcpy_target</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">memcpy</span><span class="p">(</span><span class="n">target</span><span class="p">,</span> <span class="n">source</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="o">*</span> <span class="n">count</span><span class="p">);</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">same_as_above</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">target</span><span class="p">,</span> <span class="n">source</span><span class="p">,</span> <span class="n">count</span><span class="p">);</span>
<span class="p">});</span></pre></section><section id="StudyThecudaFlowGranularity"><h2><a href="#StudyThecudaFlowGranularity">Study the Granularity</a></h2><p>Creating a cudaFlow has certain overhead, which means <em>fine-grained</em> tasking such as one GPU operation per cudaFlow may not give you any performance gain. You should aggregate as many GPU operations as possible in a cudaFlow to launch the entire graph once instead of separated graphs. For example, the following code creates a fine-grained saxpy task graph using one cudaFlow per GPU operation.</p><pre class="m-code"><span class="n">tf</span><span class="o">::</span><span class="n">Task</span> <span class="n">h2d_x</span> <span class="o">=</span> <span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([</span><span class="o">&</span><span class="p">](</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cf</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">dx</span><span class="p">,</span> <span class="n">hx</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"h2d_x"</span><span class="p">);</span>
<span class="p">}).</span><span class="n">name</span><span class="p">(</span><span class="s">"h2d_x"</span><span class="p">);</span> <span class="c1">// creates the 1st cudaFlow</span>
<span class="n">tf</span><span class="o">::</span><span class="n">Task</span> <span class="n">h2d_y</span> <span class="o">=</span> <span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([</span><span class="o">&</span><span class="p">](</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cf</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">dy</span><span class="p">,</span> <span class="n">hy</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"h2d_y"</span><span class="p">);</span>
<span class="p">}).</span><span class="n">name</span><span class="p">(</span><span class="s">"h2d_y"</span><span class="p">);</span> <span class="c1">// creates the 2nd cudaFlow </span>
<span class="n">tf</span><span class="o">::</span><span class="n">Task</span> <span class="n">d2h_x</span> <span class="o">=</span> <span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([</span><span class="o">&</span><span class="p">](</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cf</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">hx</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">dx</span><span class="p">,</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"d2h_x"</span><span class="p">);</span>
<span class="p">}).</span><span class="n">name</span><span class="p">(</span><span class="s">"d2h_x"</span><span class="p">);</span> <span class="c1">// creates the 3rd cudaFlow</span>
<span class="n">tf</span><span class="o">::</span><span class="n">Task</span> <span class="n">d2h_y</span> <span class="o">=</span> <span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([</span><span class="o">&</span><span class="p">](</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cf</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">hy</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">dy</span><span class="p">,</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"d2h_y"</span><span class="p">);</span>
<span class="p">}).</span><span class="n">name</span><span class="p">(</span><span class="s">"d2h_y"</span><span class="p">);</span> <span class="c1">// creates the 4th cudaFlow</span>
<span class="n">tf</span><span class="o">::</span><span class="n">Task</span> <span class="n">kernel</span> <span class="o">=</span> <span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([</span><span class="o">&</span><span class="p">](</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cf</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cf</span><span class="p">.</span><span class="n">kernel</span><span class="p">((</span><span class="n">N</span><span class="o">+</span><span class="mi">255</span><span class="p">)</span><span class="o">/</span><span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">saxpy</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="mf">2.0f</span><span class="p">,</span> <span class="n">dx</span><span class="p">,</span> <span class="n">dy</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"saxpy"</span><span class="p">);</span>
<span class="p">}).</span><span class="n">name</span><span class="p">(</span><span class="s">"kernel"</span><span class="p">);</span> <span class="c1">// creates the 5th cudaFlow</span>
<span class="n">kernel</span><span class="p">.</span><span class="n">succeed</span><span class="p">(</span><span class="n">h2d_x</span><span class="p">,</span> <span class="n">h2d_y</span><span class="p">)</span>
<span class="p">.</span><span class="n">precede</span><span class="p">(</span><span class="n">d2h_x</span><span class="p">,</span> <span class="n">d2h_y</span><span class="p">);</span></pre><div class="m-graph"><svg style="width: 39.875rem; height: 21.250rem;" viewBox="0.00 0.00 638.00 339.54">
<g transform="scale(1 1) rotate(0) translate(4 335.5391)">
<title>Taskflow</title>
<g class="m-cluster">
<title>cluster_p0x21987b0</title>
<polygon points="475,-166.7696 475,-323.5391 622,-323.5391 622,-166.7696 475,-166.7696"/>
<text text-anchor="middle" x="548.5" y="-306.7391">cudaFlow: h2d_x</text>
</g>
<g class="m-cluster">
<title>cluster_p0x2198870</title>
<polygon points="320,-166.7696 320,-323.5391 467,-323.5391 467,-166.7696 320,-166.7696"/>
<text text-anchor="middle" x="393.5" y="-306.7391">cudaFlow: h2d_y</text>
</g>
<g class="m-cluster">
<title>cluster_p0x2198930</title>
<polygon points="8,-8 8,-158.7696 155,-158.7696 155,-8 8,-8"/>
<text text-anchor="middle" x="81.5" y="-141.9696">cudaFlow: d2h_x</text>
</g>
<g class="m-cluster">
<title>cluster_p0x21989f0</title>
<polygon points="320,-8 320,-158.7696 467,-158.7696 467,-8 320,-8"/>
<text text-anchor="middle" x="393.5" y="-141.9696">cudaFlow: d2h_y</text>
</g>
<g class="m-cluster">
<title>cluster_p0x2198ab0</title>
<polygon points="163,-80.3848 163,-244.7696 312,-244.7696 312,-80.3848 163,-80.3848"/>
<text text-anchor="middle" x="237.5" y="-227.9696">cudaFlow: kernel</text>
</g>
<g class="m-node">
<title>p0x21987b0</title>
<polygon points="552.5,-210.7696 549.5,-214.7696 528.5,-214.7696 525.5,-210.7696 489.5,-210.7696 489.5,-174.7696 552.5,-174.7696 552.5,-210.7696"/>
<text text-anchor="middle" x="521" y="-188.9696">h2d_x</text>
</g>
<g class="m-node">
<title>p0x2198ab0</title>
<polygon points="303.5,-124.3848 300.5,-128.3848 279.5,-128.3848 276.5,-124.3848 238.5,-124.3848 238.5,-88.3848 303.5,-88.3848 303.5,-124.3848"/>
<text text-anchor="middle" x="271" y="-102.5848">kernel</text>
</g>
<g class="m-edge">
<title>p0x21987b0->p0x2198ab0</title>
<path d="M491.1754,-174.6478C484.7266,-171.4837 477.7976,-168.6183 471,-166.7696 437.7186,-157.7181 346.9612,-173.9678 316,-158.7696 304.4293,-153.0897 294.6407,-142.9418 287.178,-133.0803"/>
<polygon points="289.8414,-130.7826 281.2323,-124.6074 284.1114,-134.8035 289.8414,-130.7826"/>
</g>
<g class="m-node">
<title>p0x2198930</title>
<polygon points="139.5,-52 136.5,-56 115.5,-56 112.5,-52 76.5,-52 76.5,-16 139.5,-16 139.5,-52"/>
<text text-anchor="middle" x="108" y="-30.2">d2h_x</text>
</g>
<g class="m-edge">
<title>p0x2198ab0->p0x2198930</title>
<path d="M238.4063,-91.9106C212.7263,-80.5067 176.7664,-64.5377 149.1106,-52.2563"/>
<polygon points="150.2651,-48.9395 139.7052,-48.0796 147.4241,-55.337 150.2651,-48.9395"/>
</g>
<g class="m-node">
<title>p0x21989f0</title>
<polygon points="397.5,-52 394.5,-56 373.5,-56 370.5,-52 334.5,-52 334.5,-16 397.5,-16 397.5,-52"/>
<text text-anchor="middle" x="366" y="-30.2">d2h_y</text>
</g>
<g class="m-edge">
<title>p0x2198ab0->p0x21989f0</title>
<path d="M294.9725,-88.119C306.8762,-79.0491 321.4173,-67.9695 334.2351,-58.2031"/>
<polygon points="336.5313,-60.8538 342.3642,-52.0091 332.2888,-55.2859 336.5313,-60.8538"/>
</g>
<g class="m-node m-flat">
<title>p0x7fe390000e60</title>
<ellipse cx="528" cy="-271.1543" rx="44.5955" ry="18.2703"/>
<text text-anchor="middle" x="528" y="-267.3543">h2d_x</text>
</g>
<g class="m-edge">
<title>p0x7fe390000e60->p0x21987b0</title>
<path d="M526.3411,-252.5782C525.4981,-243.1389 524.456,-231.4692 523.5167,-220.951"/>
<polygon points="526.9986,-220.5909 522.6228,-210.9419 520.0263,-221.2136 526.9986,-220.5909"/>
</g>
<g class="m-node">
<title>p0x2198870</title>
<polygon points="397.5,-210.7696 394.5,-214.7696 373.5,-214.7696 370.5,-210.7696 334.5,-210.7696 334.5,-174.7696 397.5,-174.7696 397.5,-210.7696"/>
<text text-anchor="middle" x="366" y="-188.9696">h2d_y</text>
</g>
<g class="m-edge">
<title>p0x2198870->p0x2198ab0</title>
<path d="M336.7388,-174.7686C329.6751,-169.9182 322.3411,-164.4343 316,-158.7696 307.1007,-150.8194 298.3192,-141.1326 290.9091,-132.2614"/>
<polygon points="293.552,-129.964 284.5254,-124.4169 288.1226,-134.3824 293.552,-129.964"/>
</g>
<g class="m-node m-flat">
<title>p0x7fe390001890</title>
<ellipse cx="373" cy="-271.1543" rx="44.5955" ry="18.2703"/>
<text text-anchor="middle" x="373" y="-267.3543">h2d_y</text>
</g>
<g class="m-edge">
<title>p0x7fe390001890->p0x2198870</title>
<path d="M371.3411,-252.5782C370.4981,-243.1389 369.456,-231.4692 368.5167,-220.951"/>
<polygon points="371.9986,-220.5909 367.6228,-210.9419 365.0263,-221.2136 371.9986,-220.5909"/>
</g>
<g class="m-node m-flat">
<title>p0x7fe39000b790</title>
<ellipse cx="102" cy="-106.3848" rx="44.5955" ry="18.2703"/>
<text text-anchor="middle" x="102" y="-102.5848">d2h_x</text>
</g>
<g class="m-edge">
<title>p0x7fe39000b790->p0x2198930</title>
<path d="M103.5451,-87.7439C104.187,-80.001 104.946,-70.844 105.6539,-62.3031"/>
<polygon points="109.1427,-62.5821 106.4809,-52.3271 102.1667,-62.0038 109.1427,-62.5821"/>
</g>
<g class="m-node m-flat">
<title>p0x7fe3900017e0</title>
<ellipse cx="373" cy="-106.3848" rx="44.5955" ry="18.2703"/>
<text text-anchor="middle" x="373" y="-102.5848">d2h_y</text>
</g>
<g class="m-edge">
<title>p0x7fe3900017e0->p0x21989f0</title>
<path d="M371.1973,-87.7439C370.4485,-80.001 369.563,-70.844 368.7371,-62.3031"/>
<polygon points="372.2187,-61.9438 367.7723,-52.3271 365.2512,-62.6176 372.2187,-61.9438"/>
</g>
<g class="m-node">
<title>p0x7fe390002000</title>
<polygon points="302,-210.7696 244,-210.7696 240,-206.7696 240,-174.7696 298,-174.7696 302,-178.7696 302,-210.7696"/>
<polyline points="298,-206.7696 240,-206.7696 "/>
<polyline points="298,-206.7696 298,-174.7696 "/>
<polyline points="298,-206.7696 302,-210.7696 "/>
<text text-anchor="middle" x="271" y="-188.9696">saxpy</text>
</g>
<g class="m-edge">
<title>p0x7fe390002000->p0x2198ab0</title>
<path d="M271,-174.4496C271,-162.9966 271,-148.0043 271,-135.0062"/>
<polygon points="274.5001,-134.6783 271,-124.6784 267.5001,-134.6784 274.5001,-134.6783"/>
</g>
</g>
</svg>
</div><p>The following code aggregates the five GPU operations using one cudaFlow to achieve better performance.</p><pre class="m-code"><span class="n">tf</span><span class="o">::</span><span class="n">Task</span> <span class="n">cudaflow</span> <span class="o">=</span> <span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([</span><span class="o">&</span><span class="p">](</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cf</span><span class="p">)</span> <span class="p">{</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">h2d_x</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">dx</span><span class="p">,</span> <span class="n">hx</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"h2d_x"</span><span class="p">);</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">h2d_y</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">dy</span><span class="p">,</span> <span class="n">hy</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"h2d_y"</span><span class="p">);</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">d2h_x</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">hx</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">dx</span><span class="p">,</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"d2h_x"</span><span class="p">);</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">d2h_y</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">hy</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">dy</span><span class="p">,</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"d2h_y"</span><span class="p">);</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">saxpy</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">kernel</span><span class="p">((</span><span class="n">N</span><span class="o">+</span><span class="mi">255</span><span class="p">)</span><span class="o">/</span><span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">saxpy</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="mf">2.0f</span><span class="p">,</span> <span class="n">dx</span><span class="p">,</span> <span class="n">dy</span><span class="p">)</span>
<span class="p">.</span><span class="n">name</span><span class="p">(</span><span class="s">"saxpy"</span><span class="p">);</span>
<span class="n">saxpy</span><span class="p">.</span><span class="n">succeed</span><span class="p">(</span><span class="n">h2d_x</span><span class="p">,</span> <span class="n">h2d_y</span><span class="p">)</span>
<span class="p">.</span><span class="n">precede</span><span class="p">(</span><span class="n">d2h_x</span><span class="p">,</span> <span class="n">d2h_y</span><span class="p">);</span>
<span class="p">}).</span><span class="n">name</span><span class="p">(</span><span class="s">"saxpy"</span><span class="p">);</span> <span class="c1">// creates one cudaFlow</span></pre><div class="m-graph"><svg style="width: 20.000rem; height: 6.250rem;" viewBox="0.00 0.00 320.19 99.77">
<g transform="scale(1 1) rotate(0) translate(4 95.7696)">
<title>Taskflow</title>
<g class="m-node m-flat">
<title>p0x7f2870401a50</title>
<ellipse cx="44.5477" cy="-73.3848" rx="44.5955" ry="18.2703"/>
<text text-anchor="middle" x="44.5477" y="-69.5848">h2d_x</text>
</g>
<g class="m-node">
<title>p0x7f2870402bc0</title>
<polygon points="187.0955,-63.3848 129.0955,-63.3848 125.0955,-59.3848 125.0955,-27.3848 183.0955,-27.3848 187.0955,-31.3848 187.0955,-63.3848"/>
<polyline points="183.0955,-59.3848 125.0955,-59.3848 "/>
<polyline points="183.0955,-59.3848 183.0955,-27.3848 "/>
<polyline points="183.0955,-59.3848 187.0955,-63.3848 "/>
<text text-anchor="middle" x="156.0955" y="-41.5848">saxpy</text>
</g>
<g class="m-edge">
<title>p0x7f2870401a50->p0x7f2870402bc0</title>
<path d="M82.6241,-63.8271C93.0508,-61.2098 104.3981,-58.3615 114.9863,-55.7037"/>
<polygon points="116.0035,-59.0571 124.8505,-53.2277 114.2993,-52.2677 116.0035,-59.0571"/>
</g>
<g class="m-node m-flat">
<title>p0x7f2870402310</title>
<ellipse cx="267.6432" cy="-73.3848" rx="44.5955" ry="18.2703"/>
<text text-anchor="middle" x="267.6432" y="-69.5848">d2h_x</text>
</g>
<g class="m-edge">
<title>p0x7f2870402bc0->p0x7f2870402310</title>
<path d="M187.1677,-53.1843C197.1376,-55.6869 208.4272,-58.5208 219.3295,-61.2574"/>
<polygon points="218.6328,-64.691 229.184,-63.731 220.3371,-57.9016 218.6328,-64.691"/>
</g>
<g class="m-node m-flat">
<title>p0x7f2870402780</title>
<ellipse cx="267.6432" cy="-18.3848" rx="44.5955" ry="18.2703"/>
<text text-anchor="middle" x="267.6432" y="-14.5848">d2h_y</text>
</g>
<g class="m-edge">
<title>p0x7f2870402bc0->p0x7f2870402780</title>
<path d="M187.1677,-37.8638C197.1376,-35.4506 208.4272,-32.7179 219.3295,-30.079"/>
<polygon points="220.2881,-33.4482 229.184,-27.6938 218.6413,-26.6446 220.2881,-33.4482"/>
</g>
<g class="m-node m-flat">
<title>p0x7f2870401eb0</title>
<ellipse cx="44.5477" cy="-18.3848" rx="44.5955" ry="18.2703"/>
<text text-anchor="middle" x="44.5477" y="-14.5848">h2d_y</text>
</g>
<g class="m-edge">
<title>p0x7f2870401eb0->p0x7f2870402bc0</title>
<path d="M83.2487,-27.7523C93.4425,-30.2197 104.4785,-32.8909 114.8031,-35.39"/>
<polygon points="114.2904,-38.8669 124.8331,-37.8178 115.9372,-32.0634 114.2904,-38.8669"/>
</g>
</g>
</svg>
</div><aside class="m-note m-info"><h4>Note</h4><p>We encourage users to understand the parallel structure of their applications to come up with the best granularity of task decomposition. A refined task graph can have significant performance difference from the raw counterpart.</p></aside></section><section id="OffloadAcudaFlow"><h2><a href="#OffloadAcudaFlow">Offload a cudaFlow</a></h2><p>By default, the executor offloads and executes the cudaFlow <em>once</em>, if the cudaFlow is never offloaded from its callable. During the execution, the executor first materializes the cudaFlow by mapping it to a native CUDA graph, creates an executable graph from the native CUDA graph, and then submit the executable graph to the CUDA runtime. Similar to <a href="classtf_1_1Executor.html" class="m-doc">tf::<wbr />Executor</a>, <a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a> provides several offload methods to run the GPU task graph:</p><pre class="m-code"><span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([](</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cf</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// ... create CUDA tasks</span>
<span class="n">cf</span><span class="p">.</span><span class="n">offload</span><span class="p">();</span> <span class="c1">// offload the cudaFlow and run it once</span>
<span class="n">cf</span><span class="p">.</span><span class="n">offload_n</span><span class="p">(</span><span class="mi">10</span><span class="p">);</span> <span class="c1">// offload the cudaFlow and run it 10 times</span>
<span class="n">cf</span><span class="p">.</span><span class="n">offload_until</span><span class="p">([</span><span class="n">repeat</span><span class="o">=</span><span class="mi">5</span><span class="p">]</span> <span class="p">()</span> <span class="k">mutable</span> <span class="p">{</span> <span class="k">return</span> <span class="n">repeat</span><span class="o">--</span> <span class="o">==</span> <span class="mi">0</span><span class="p">;</span> <span class="p">})</span> <span class="c1">// five times</span>
<span class="p">});</span></pre><p>After you offload a cudaFlow, it is considered executed, and the executor will <em>not</em> run an offloaded cudaFlow after leaving the cudaFlow task callable. On the other hand, if a cudaFlow is not offloaded, the executor runs it once. For example, the following two versions represent the same execution logic.</p><pre class="m-code"><span class="c1">// version 1: explicitly offload a cudaFlow once</span>
<span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([](</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cf</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cf</span><span class="p">.</span><span class="n">single_task</span><span class="p">([]</span> <span class="n">__device__</span> <span class="p">(){});</span>
<span class="n">cf</span><span class="p">.</span><span class="n">offload</span><span class="p">();</span>
<span class="p">});</span>
<span class="c1">// version 2 (same as version 1): executor offloads the cudaFlow once</span>
<span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([](</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">sf</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cf</span><span class="p">.</span><span class="n">single_task</span><span class="p">([]</span> <span class="n">__device__</span> <span class="p">(){});</span>
<span class="p">});</span></pre></section><section id="UpdateAcudaFlow"><h2><a href="#UpdateAcudaFlow">Update a cudaFlow</a></h2><p>Many GPU applications require you to launch a cudaFlow multiple times and update node parameters (e.g., kernel parameters and memory addresses) between iterations. <a href="classtf_1_1cudaFlow.html#a85789ed8a1f47704cf1f1a2b98969444" class="m-doc">tf::<wbr />cudaFlow::<wbr />offload</a> allows you to execute the graph immediately and then update the parameters for the next execution. When you offload a cudaFlow, an executable graph will be created, and you must NOT change the topology but the node parameters between successive executions.</p><pre class="m-code"><span class="mi">1</span><span class="o">:</span> <span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([</span><span class="o">&</span><span class="p">]</span> <span class="p">(</span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="o">&</span> <span class="n">cf</span><span class="p">)</span> <span class="p">{</span>
<span class="mi">2</span><span class="o">:</span> <span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">task</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">kernel</span><span class="p">(</span><span class="n">grid1</span><span class="p">,</span> <span class="n">block1</span><span class="p">,</span> <span class="n">shm1</span><span class="p">,</span> <span class="n">my_kernel</span><span class="p">,</span> <span class="n">args1</span><span class="p">...);</span>
<span class="mi">3</span><span class="o">:</span> <span class="n">cf</span><span class="p">.</span><span class="n">offload</span><span class="p">();</span> <span class="c1">// immediately run the cudaFlow once</span>
<span class="mi">4</span><span class="o">:</span>
<span class="mi">5</span><span class="o">:</span> <span class="n">cf</span><span class="p">.</span><span class="n">kernel</span><span class="p">(</span><span class="n">task</span><span class="p">,</span> <span class="n">grid2</span><span class="p">,</span> <span class="n">block2</span><span class="p">,</span> <span class="n">shm2</span><span class="p">,</span> <span class="n">my_kernel</span><span class="p">,</span> <span class="n">args2</span><span class="p">...);</span>
<span class="mi">6</span><span class="o">:</span> <span class="n">cf</span><span class="p">.</span><span class="n">offload</span><span class="p">();</span> <span class="c1">// run the cudaFlow again with the same graph topology</span>
<span class="mi">7</span><span class="o">:</span> <span class="c1">// but with different kernel parameters</span>
<span class="mi">8</span><span class="o">:</span> <span class="p">});</span></pre><p>Debrief:</p><ul><li>Line 2 creates a kernel task to run <code>my_kernel</code> with the given parameters.</li><li>Line 3 offloads the cudaFlow and performs an immediate execution.</li><li>Line 5 updates the parameters of <code>my_kernel</code> through its task.</li><li>Line 6 executes the cudaFlow again with updated kernel parameters.</li></ul><p>Between successive offloads (i.e., executions of a cudaFlow), you can update the task parameters, such as changing the kernel execution parameters and memory operation parameters. However, you must <em>NOT</em> change the topology of an offloaded cudaFlow. Each method of task creation in <a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a> has an overload that updates the parameters of the task created from the same creation method.</p><aside class="m-note m-warning"><h4>Attention</h4><p>There are a few restrictions on updating task parameters in a cudaFlow. Notably, you must <em>NOT</em> change the topology of an offloaded graph. In addition, update methods have the following limitations:</p><ul><li>kernel task<ul><li>The kernel function is not allowed to change. This restriction applies to all algorithm tasks that are created using lambda.</li></ul></li><li>memset and memcpy tasks:<ul><li>The CUDA device(s) to which the operand(s) was allocated/mapped cannot change</li><li>The source/destination memory must be allocated from the same contexts as the original source/destination memory.</li></ul></li></ul></aside></section><section id="UsecudaFlowInAStandaloneEnvironment"><h2><a href="#UsecudaFlowInAStandaloneEnvironment">Use cudaFlow in a Standalone Environment</a></h2><p>You can use <a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a> in a standalone environment without going through <a href="classtf_1_1Taskflow.html" class="m-doc">tf::<wbr />Taskflow</a> and offloads it to a GPU from the caller thread. All the features we have discussed so far apply to the standalone use. The following code gives an example of using a standalone cudaFlow to create a saxpy task graph that runs on a GPU.</p><pre class="m-code"><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span> <span class="n">cf</span><span class="p">;</span> <span class="c1">// create a standalone cudaFlow</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">h2d_x</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">dx</span><span class="p">,</span> <span class="n">hx</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"h2d_x"</span><span class="p">);</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">h2d_y</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">dy</span><span class="p">,</span> <span class="n">hy</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"h2d_y"</span><span class="p">);</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">d2h_x</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">hx</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">dx</span><span class="p">,</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"d2h_x"</span><span class="p">);</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">d2h_y</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">hy</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="n">dy</span><span class="p">,</span> <span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">"d2h_y"</span><span class="p">);</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span> <span class="n">saxpy</span> <span class="o">=</span> <span class="n">cf</span><span class="p">.</span><span class="n">kernel</span><span class="p">((</span><span class="n">N</span><span class="o">+</span><span class="mi">255</span><span class="p">)</span><span class="o">/</span><span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">saxpy</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="mf">2.0f</span><span class="p">,</span> <span class="n">dx</span><span class="p">,</span> <span class="n">dy</span><span class="p">)</span>
<span class="p">.</span><span class="n">name</span><span class="p">(</span><span class="s">"saxpy"</span><span class="p">);</span>
<span class="n">saxpy</span><span class="p">.</span><span class="n">succeed</span><span class="p">(</span><span class="n">h2d_x</span><span class="p">,</span> <span class="n">h2d_y</span><span class="p">)</span> <span class="c1">// kernel runs after host-to-device copy</span>
<span class="p">.</span><span class="n">precede</span><span class="p">(</span><span class="n">d2h_x</span><span class="p">,</span> <span class="n">d2h_y</span><span class="p">);</span> <span class="c1">// kernel runs before device-to-host copy</span>
<span class="n">cf</span><span class="p">.</span><span class="n">offload</span><span class="p">();</span> <span class="c1">// offload and run the standalone cudaFlow once</span></pre><p>When using cudaFlow in a standalone environment, it is your choice to decide its GPU context. The following example creates a cudaFlow and executes it on GPU 0.</p><pre class="m-code"><span class="n">tf</span><span class="o">::</span><span class="n">cudaScopedDevice</span> <span class="n">gpu</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span> <span class="n">cf</span><span class="p">;</span> <span class="c1">// create a standalone cudaFlow on GPU 0</span>
<span class="n">cf</span><span class="p">.</span><span class="n">offload</span><span class="p">();</span> <span class="c1">// run the capturer once on GPU 0</span></pre><aside class="m-note m-info"><h4>Note</h4><p>In the standalone mode, a written cudaFlow will not be executed untile you explicitly call an offload method, as there is neither a taskflow nor an executor.</p></aside></section>
</div>
</div>
</div>
</article></main>
<div class="m-doc-search" id="search">
<a href="#!" onclick="return hideSearch()"></a>
<div class="m-container">
<div class="m-row">
<div class="m-col-m-8 m-push-m-2">
<div class="m-doc-search-header m-text m-small">
<div><span class="m-label m-default">Tab</span> / <span class="m-label m-default">T</span> to search, <span class="m-label m-default">Esc</span> to close</div>
<div id="search-symbolcount">…</div>
</div>
<div class="m-doc-search-content">
<form>
<input type="search" name="q" id="search-input" placeholder="Loading …" disabled="disabled" autofocus="autofocus" autocomplete="off" spellcheck="false" />
</form>
<noscript class="m-text m-danger m-text-center">Unlike everything else in the docs, the search functionality <em>requires</em> JavaScript.</noscript>
<div id="search-help" class="m-text m-dim m-text-center">
<p class="m-noindent">Search for symbols, directories, files, pages or
modules. You can omit any prefix from the symbol or file path; adding a
<code>:</code> or <code>/</code> suffix lists all members of given symbol or
directory.</p>
<p class="m-noindent">Use <span class="m-label m-dim">↓</span>
/ <span class="m-label m-dim">↑</span> to navigate through the list,
<span class="m-label m-dim">Enter</span> to go.
<span class="m-label m-dim">Tab</span> autocompletes common prefix, you can
copy a link to the result using <span class="m-label m-dim">⌘</span>
<span class="m-label m-dim">L</span> while <span class="m-label m-dim">⌘</span>
<span class="m-label m-dim">M</span> produces a Markdown link.</p>
</div>
<div id="search-notfound" class="m-text m-warning m-text-center">Sorry, nothing was found.</div>
<ul id="search-results"></ul>
</div>
</div>
</div>
</div>
</div>
<script src="search-v1.js"></script>
<script src="searchdata-v1.js" async="async"></script>
<footer><nav>
<div class="m-container">
<div class="m-row">
<div class="m-col-l-10 m-push-l-1">
<p>Taskflow handbook is part of the <a href="https://taskflow.github.io">Taskflow project</a>, copyright © <a href="https://tsung-wei-huang.github.io/">Dr. Tsung-Wei Huang</a>, 2018–2022.<br />Generated by <a href="https://doxygen.org/">Doxygen</a> 1.8.14 and <a href="https://mcss.mosra.cz/">m.css</a>.</p>
</div>
</div>
</div>
</nav></footer>
</body>
</html>
You can’t perform that action at this time.
