Jun/10

27

# Easy CUDA FERMI 10 times speed increase with Kappa

## Kappa can show an easy ten times speed increase on GF100 GPUs (compared to CUDA driver or runtime 3.1 API programs)

Here are two Kappa scheduling scripts that only differ by the assignment of a stream to a kernel–one has the same stream assigned to all kernel executions so that the kernels execute sequentially and the other allows Kappa’s default behavior of assigning different streams so that the kernels execute concurrently. The results I will show are for a gtx470 but should not be significantly different for a gtx480, C2050, or C2070. Also these results are using the CUDA ToolKit 3.1 with Kappa 1.2.0 (and do not forget to use the 256.35 driver–you only get 3.0 ToolKit performance with the 195.xx drivers).
Here is the kernel to be executed (extracted from the NVIDIA SDK with a ‘extern “C”‘ added so the kernel is externally visible):
 /* * Copyright 1993-2010 NVIDIA Corporation. All rights reserved. * * NVIDIA Corporation and its licensors retain all intellectual property and * proprietary rights in and to this software and related documentation. * Any use, reproduction, disclosure, or distribution of this software * and related documentation without an express license agreement from * NVIDIA Corporation is strictly prohibited. * * Please refer to the applicable NVIDIA end user license agreement (EULA) * associated with this source code for terms and conditions that govern * your use of this NVIDIA software. * */

 // // This sample demonstrates the use of streams for concurrent execution // // Devices of compute capability 1.x will run the kernels one after another // Devices of compute capability 2.0 or higher can overlap the kernels // extern "C" __global__ void mykernel( int *a, int n ) { int idx = threadIdx.x; int value = 1; for(int i=0; i value *= sin( (float)i ) + tan( (float)i ); 

a[idx] = value; } 
(Save this into a file named: cuda/concurrentKernels.cu).

Here is the version of the scheduling script, nvnonconcurrent.k, that is not concurrent (since it is sequential on the same stream):
 

 !CUDA/Kernel STREAM='astream' BLOCKSHAPE=[ 4, 64 ] -> mykernel@concurrent(A,#n) [ A = #n ]; //!Context CONTEXT_FLAGS=%CUDA{CU_CTX_SCHED_SPIN} -> context; !Context CONTEXT_FLAGS=%CUDA{CU_CTX_SCHED_AUTO} -> context; !Value -> nblocks = 4; // number of blocks !Value -> nthreads = 64; // number of threads !Value -> n = 50000; !Value -> nkernels = 8; // number of kernels !Value -> nelements = (8 * 64); // variable A size !CUDA/Module MODULE_TYPE=%KAPPA{CU_MODULE} -> concurrent = 'concurrentKernels.cu'; !Variable VARIABLE_TYPE=%KAPPA{DeviceOnly} -> A(#nelements,%sizeof{uint32_t}); !Timer -> mult; !Subroutine -> kernelsub; !Timer -> mult; !Subroutine EXPAND=true LOOP=100 -> kernelsub; !Timer -> mult; !CUDA/Kernel/Attributes MODULE=concurrent -> mykernel; !Print ( 'MaxThreadsPerBlock', /kappa/CUDA/concurrent/mykernel#MaxThreadsPerBlock, 'RegistersPerThread', /kappa/CUDA/concurrent/mykernel#RegistersPerThread ); !Print ( 'StaticSharedMemory', /kappa/CUDA/concurrent/mykernel#StaticSharedMemory, 'ConstantMemory', /kappa/CUDA/concurrent/mykernel#ConstantMemory, 'ThreadLocalMemory', /kappa/CUDA/concurrent/mykernel#ThreadLocalMemory ); !Print ( 'PTXVersion', /kappa/CUDA/concurrent/mykernel#PTXVersion, 'BinaryVersion', /kappa/CUDA/concurrent/mykernel#BinaryVersion ); !Free -> A; !CUDA/ModuleUnload -> concurrent; 

!ContextReset -> Context_reset; //!Context -> context; !Stop; !Finish; 

and here is the version (missing the: STREAM='astream' to put the kernels all on the same stream) that executes concurrently, nvconcurrent.k:
 

 !CUDA/Kernel BLOCKSHAPE=[ 4, 64 ] -> mykernel@concurrent(A,#n) [ A = #n ]; //!Context CONTEXT_FLAGS=%CUDA{CU_CTX_SCHED_SPIN} -> context; !Context CONTEXT_FLAGS=%CUDA{CU_CTX_SCHED_AUTO} -> context; !Value -> nblocks = 4; // number of blocks !Value -> nthreads = 64; // number of threads !Value -> n = 50000; !Value -> nkernels = 8; // number of kernels !Value -> nelements = (8 * 64); // variable A size !CUDA/Module MODULE_TYPE=%KAPPA{CU_MODULE} -> concurrent = 'concurrentKernels.cu'; !Variable VARIABLE_TYPE=%KAPPA{DeviceOnly} -> A(#nelements,%sizeof{uint32_t}); !Timer -> mult; !Subroutine -> kernelsub; !Timer -> mult; !Subroutine EXPAND=true LOOP=100 -> kernelsub; !Timer -> mult; !CUDA/Kernel/Attributes MODULE=concurrent -> mykernel; !Print ( 'MaxThreadsPerBlock', /kappa/CUDA/concurrent/mykernel#MaxThreadsPerBlock, 'RegistersPerThread', /kappa/CUDA/concurrent/mykernel#RegistersPerThread ); !Print ( 'StaticSharedMemory', /kappa/CUDA/concurrent/mykernel#StaticSharedMemory, 'ConstantMemory', /kappa/CUDA/concurrent/mykernel#ConstantMemory, 'ThreadLocalMemory', /kappa/CUDA/concurrent/mykernel#ThreadLocalMemory ); !Print ( 'PTXVersion', /kappa/CUDA/concurrent/mykernel#PTXVersion, 'BinaryVersion', /kappa/CUDA/concurrent/mykernel#BinaryVersion ); !Free -> A; !CUDA/ModuleUnload -> concurrent; 

!ContextReset -> Context_reset; //!Context -> context; !Stop; !Finish; 

and here are the results for
1> nonconcurrent:
 /usr/bin/time ikappa k/nvnonconcurrent.k Processing time: 0.141728 (ms) MaxThreadsPerBlock 1024 RegistersPerThread 18 StaticSharedMemory 0 ConstantMemory 24 ThreadLocalMemory 4 PTXVersion 10 BinaryVersion 20 Processing time: 4486.72 (ms) 1.40user 0.17system 0:04.72elapsed 33%CPU (0avgtext+0avgdata 81376maxresident)k 0inputs+0outputs (0major+1867minor)pagefaults 0swaps 

2> concurrent:
 /usr/bin/time ikappa k/nvconcurrent.k Processing time: 0.013824 (ms) MaxThreadsPerBlock 1024 RegistersPerThread 18 StaticSharedMemory 0 ConstantMemory 24 ThreadLocalMemory 4 PTXVersion 10 BinaryVersion 20 Processing time: 391.836 (ms) 0.24user 0.08system 0:00.61elapsed 52%CPU (0avgtext+0avgdata 81392maxresident)k 0inputs+0outputs (0major+1867minor)pagefaults 0swaps 

The first 'Processing time' for each result is for executing the kernel once--the second 'Processing time' is for executing it 100 times (the LOOP parameter to the subroutine expansion).
(With CUDA ToolKit 3.1/Driver 195.xx, the concurrent times were around 1200 (ms)).

So this shows a speed up of:

4486.72 / 391.836 = 11.45

Please note that this is not some esoteric speed up that is hard to consistently achieve--this is the default behavior of the Kappa Library. As has been discussed on the NVIDIA forums, if you do not have a scheduler, then synchronizations occur at algorithm steps or component boundaries--if not much more often even than that. This limits the amount of concurrent kernel execution that can occur. With a scheduler, kernels can be automatically assigned to streams and asynchronously launched in batches to have concurrent kernel execution. This also goes a long way toward solving the problem of 'occupying' the GPU. If you hand code precise algorithms, tuned to particular GPUs to achieve full occupancy and kernel concurrent execution, you will still introduce arbitrary synchronization points and not achieve the concurrent kernel execution and GPU occupancy that Kappa's scheduler can achieve.

The final benefit of the Kappa scheduler, is that it schedules both CUDA and C/C++ (OpenMP) kernels. There are plenty of academic papers arguing that one or the other is better for certain algorithms--the point of the Kappa library is that you can choose which (or both) to use for any given algorithm step--the Kappa scheduler will make them flow together.

You can also do the calculation that NVIDIA does in their example:

391.836 / (100 * 0.013824) = 283

which is the formula:

N iteration result / ( number iterations * single iteration result)
but I do not believe that result (the denominator single result is too variable among other problems).

Just for fun, you can try adding:

CACHE=%CUDA{CU_FUNC_CACHE_PREFER_L1}

or
CACHE=%CUDA{CU_FUNC_CACHE_PREFER_SHARED}
similar to:
 !CUDA/Kernel BLOCKSHAPE=[ 4, 64 ] CACHE=%CUDA{CU_FUNC_CACHE_PREFER_SHARED} -> mykernel@concurrent(A,#n) [ A = #n ]; 
to try the two different cache/shared memory options.

Also try changing (adding) the STREAM_POOL_SIZE configuration setting for Kappa in the kappa.conf file:
 [/Kappa] PROCESSES_INCREMENT=8 STREAM_POOL_SIZE=128 

it is safe to set the STREAM_POOL_SIZE large (2048 is fine for example) since real CUDA streams are only created if your program can make use of them.

You can try this using the free license of Kappa. You can try your own kernels (assuming you have a GF100 class card). These scheduling scripts show you the JIT compiled attributes for the kernel as actually used by the GPU--this can be useful if you are trying to tweak your kernel to execute more concurrently.

You can, of course, write your own libraries with a scheduler--and it will probably end up looking a lot like the Kappa Library. But the Kappa Library is not expensive, you can start using it today, and you can save the time, time delay, and expense of creating your own library.