Psi Lambda LLC | Parallel computing made practical.



Kappa Quick Start Guide for Linux

In this quick start guide, it should be apparent that Kappa makes things simple but that it is a superset of the CUDA runtime. Kappa gives access to the CUDA driver functionality–see the Kappa User Guide and the Kappa Reference Manual for details. Kappa does not need the NVIDA SDK and does not need the NVIDA Toolkit if PTX files are used.

Download and installation

Please use the appropriate guide for download and installation:

Quick Start Example Files

The following files are referred to in the following examples:

Kappa installation verification

The check.k Kappa script may be used to verify that Kappa is installed properly and is compatible with the system it is installed on by checking some basic functionality. Save the check.k script to a file named check.k (make sure that the enclosing <kappa> and </kappa> tags are present–Kappa only processes content enclosed by these tags). The script can then be executed by running the following command within the terminal window:

ikappa check.k

Details on the ikappa program and the Kappa language are given in the Kappa_User_Guide.pdf which can be read online. Source code for ikappa is installed (by default in /usr/share/kappa/extras/ikappa) by installing kappa-doc.

    This script:

  • uses the !CUDAConfig; command to load the CUDA attributes for the device as configuration values,
  • prints these attributes of the CUDA device,
  • prints a value from the cuda_translation.conf file (the CUDA_VERSION value),
  • creates a CUDA context,
  • stops the background execution engine,
  • and finishes (exits)

  • !CUDAConfig;
    !Print ( 'Name', /Kappa/CUDA/GPU/Current#Name );
    !Print ( 'Major', /Kappa/CUDA/GPU/Current#Major );
    !Print ( 'Minor', /Kappa/CUDA/GPU/Current#Minor );

    !Print ( 'Recommended CUDA Version:', %CUDA{CUDA_VERSION} );
    !Context -> context;

The output from running it should look something similar to the following:

Kappa demonstration mode
Name GeForce GTX 470
Major 2
Minor 0
GlobalMemory 1341849600
ConstantMemory 65536
SharedMemoryPerBlock 49152
RegistersPerBlock 32768
WarpSize 32
MaxThreadsPerBlock 1024
MaxThreads 1024 1024 64
MaxGridSize 65535 65535 1
MemoryPitch 2147483647
TextureAlignment 512
ClockRate 1215000
MultiProcessorCount 14
ConcurrentCopyExecute 1
KernelTimeout 0
Integrated 0
CanMapHostMemory 1
ComputeMode 0
MaximumTexture1DWidth 8192
MaximumTexture2DWidth 65536
MaximumTexture2DHeight 65536
MaximumTexture3DWidth 2048
MaximumTexture3DHeight 2048
MaximumTexture3DDepth 2048
MaximumArrayWidth 16384
MaximumArrayHeight 16384
MaximumArraySlices 2048
SurfaceAlignment 512
ConcurrentKernels 1
ECCSupported 0
Recommended CUDA Version: 3000
Number of variables: 0
Number of CommandQueue items: 35

The line:

Kappa demonstration mode

and the lines:

Number of variables: 0
Number of CommandQueue items: 35

will not appear if a Software License Key file is properly installed. These messages appear when the Kappa Library is running in the free mode.

C (C++) and CUDA Modules

In the following example, you will see that Kappa Values are dynamic and allow dynamic resizing even for kernel launch grid, blockshape, etc. The example shows Values being retrieved from configuration values and calculations but please note that CUDA and C (C++) modules can change them also (please read the Kappa_User_Guide.pdf or the Kappa Library User Guide online for more details). You will also see that you may dynamically use compiled kernel attributes such as RegistersPerThread. CUDA modules are JIT compiled so that the binary version is optimized for real GPU usage.

Kappa automatically schedules kernel execution based on dependencies. This allows for concurrent or out-of-order kernel execution without explicit stream management. A C (C++) or CUDA kernel may be dependent on Kappa Values or Variables. A Kappa ‘map’ is used to specify the input and output dependencies of kernels. A map is given as part of a kernel statement and looks similar to the following:

[ C = A B #WA #WB ]

where, within the square brackets, is a comma delimited list of output and input specifications. The above example shows the map for a kernel that has the Variable ‘C’ as an output, the Variable’s ‘A’ and ‘B’ and the Values ‘WA’ and ‘WB’ as inputs. In other words, the kernel depends on ‘A’, ‘B’, ‘WA’, and ‘WB’, and produces or changes ‘C’, which other kernels or statements may depend on. Variable and Value statements (as well as other statements) have implicit map dependencies–for example, creating a Variable or Value makes it available for other statements to depend on.

The example also demonstrates OpenMP functionality. Kappa does not rely on OpenMP for its internal functioning but does enable it for C (C++) module usage.

Module example setup

To try the following example you will need to have the g++ compile environment installed properly. Usually installing a gcc-c++ system package and its dependencies is sufficient–check your Linux distribution documentation or forums for details.

Download the quickstart.tar.gz file and then run the following commands in a terminal window:

tar zxvf quickstart.tar.gz
cd quickstart

Edit the user.conf file (in the newly created quickstart directory) and change </My/Path/To/QuickStart> to the output path that was printed by the the pwd command. Then, to put this configuration file where Kappa will use it, execute the following commands:

mkdir ~/.kappa.d
cp user.conf ~/.kappa.d/.

To compile the libTestModule shared library module, execute the following commands:

cd TestModule
cd ..

The commands in the file assume that you have g++ properly installed, that the ‘-fPIC’ and ‘-fopenmp’ options to g++ work correctly on your system and that the ‘-shared’ and ‘-Wl,-soname’ g++ linker options also work correctly on your system.
If the proceeding commands have executed correctly, then running the command:

ikappa modules.k

will produce output similar to the following:

Kappa demonstration mode
MaxThreadsPerBlock 1024 RegistersPerThread 21
StaticSharedMemory 2048 ConstantMemory 0 ThreadLocalMemory 0
PTXVersion 10 BinaryVersion 20
Hello from thread 0, nthreads 2 and arg: 128
Hello from thread 1, nthreads 2 and arg: 128
Device: Starting Free Memory: 1272446976
Ending Free Memory: 1272446976
Difference Memory: 0
Total: 1341849600
Used: 106496
Number of variables: 3
Number of CommandQueue items: 29

(for Fermi hardware) or similar to the following:

Kappa demonstration mode
MaxThreadsPerBlock 512 RegistersPerThread 13
StaticSharedMemory 2084 ConstantMemory 0 ThreadLocalMemory 0
PTXVersion 10 BinaryVersion 11
Hello from thread 0, nthreads 2 and arg: 128
Hello from thread 1, nthreads 2 and arg: 128
Device: Starting Free Memory: 87945216
Ending Free Memory: 87945216
Difference Memory: 0
Total: 266010624
Used: 86016
Number of variables: 3
Number of CommandQueue items: 29

The above example used the PTX file in preference to the CUDA ‘.cu’ file since it is newer. If you have the NVIDIA Toolkit 3.1 installed, you may edit /etc/kappa.d/kappa.conf and change the CUDA_PATH and NVCC_PTX to have appropriate values–change every occurrence of ‘/usr/local/cuda’ to be the correct path to your installation of the NVIDIA Toolkit. Then if you remove the cuda/matrixMul_kernel.ptx file and run the command:

ikappa modules.k

Kappa should compile the CUDA ‘.cu’ file using ‘nvcc’ to (re)produce the cuda/matrixMul_kernel.ptx file. Subsequent runs of the ikappa command will use the existing cuda/matrixMul_kernel.ptx file.

Here is a brief tour of the statements in the modules.k script–full details are in the Kappa User Guide:

  • A Context statement to create a CUDA context.

  • !Context -> context;

  • Value statements to configure the dimensions of A, B, and C matrices.

  • !Value -> WA = (3 * {BLOCK_SIZE}); // Matrix A width
    !Value -> HA = (5 * {BLOCK_SIZE}); // Matrix A height
    !Value -> WB = (8 * {BLOCK_SIZE}); // Matrix B width
    !Value -> HB = #WA; // Matrix B height
    !Value -> WC = #WB; // Matrix C width
    !Value -> HC = #HA; // Matrix C height

  • Statements to load the libTestModule C++ shared library module and the CUDA matrixMul kernel module. These statements use configuration values defined in the user.conf file to find the correct paths and files.

  • !C/Module -> testmodule={CMODULE};

  • Statements to create Variables for the A and B matrices.

  • !Variable -> A(#WA,#HA,%sizeof{float});
    !Variable -> B(#WB,#HB,%sizeof{float});

  • Statements to call the C++ randomInit functions to initialize the A and B matrices.

  • !C/Kernel MODULE='testmodule' -> randomInit (A,{A_SIZE}) [A];
    !C/Kernel -> randomInit@testmodule (B,{B_SIZE}) [B];

  • A statment to create a Variable for the C matrix and initialize it.

    -> C(#WC,#HC,%sizeof{float});

  • A statement to do a matrix multiplication by calling the matrixMul kernel (from the NVIDIA SDK example).

  • !CUDA/Kernel
    GRID=[ 8, 5 ]
    SHAREDMEMORY=( 2 * {BLOCK_SIZE} * {BLOCK_SIZE} * %sizeof{float} )
    -> matrixMul@matrixMul(C,A,B,#WA,#WB) [ C = A B #WA #WB ];

  • A statement to check the result of the matrixMul kernel by calling CheckGold C++ function (based on the function from the NVIDIA SDK example)

  • !C/Kernel -> CheckGold@testmodule(A,B,C,#HA,#WA,#WB,#HC,#WC)
    [ = A B C #HA #WA #WB #HC #WC ];

  • Statements to free the Variables for the matrices

  • !Free -> A;
    !Free -> B;
    !Free -> C;

  • Statements to load the matrixMul kernel attributes into Values and print them

  • !CUDA/Kernel/Attributes MODULE=matrixMul -> matrixMul;
    !Print ( 'MaxThreadsPerBlock',
    /kappa/CUDA/matrixMul/matrixMul#RegistersPerThread );
    !Print ( 'StaticSharedMemory',
    /kappa/CUDA/matrixMul/matrixMul#ThreadLocalMemory );
    !Print ( 'PTXVersion', /kappa/CUDA/matrixMul/matrixMul#PTXVersion,
    'BinaryVersion', /kappa/CUDA/matrixMul/matrixMul#BinaryVersion );

  • A statement to call an OpenMP function

  • !C/Kernel -> OpenMP@testmodule(#WB) [ = #WB ];

  • Statements to unload the CUDA and C++ modules

  • !CUDA/ModuleUnload -> matrixMul;
    !C/ModuleUnload -> testmodule;

  • A statement to reset the context.

  • !ContextReset -> Context_reset;

  • The Context statement again to report the device memory usage.

  • !Context -> context;

  • The statements to Stop and Finish.

  • !Stop;

Kappa Configuration Files

Feel free to edit the configuration files in /etc/kappa.d. Please refer to the Kappa_User_Guide.pdf or the Kappa Library User Guide online documentation for details on configuration settings. A copy of the original configuration files can be installed (by default in /usr/share/kappa/kappa.d/) by installing kappa-doc.

Software License Key installation

Buying a Software License Key file (and properly installing it) removes the extra messages about running in demonstration mode and of the number of operations or variables accessed. It also removes the limitations of: 10 variables, 125 puts, no OutputRoutines, and one Kappa instance.

Software License Key files and Kappa Library licensing are cross platform–system platform is not a consideration for licensing and the same Software License Key file works on all Kappa Library platforms.

To obtain a Software License Key file, follow the instructions on the License Key(s) page. Once a button for your Software License Key file appears on that page, click the button to download the file and save the file. Copy the Software License Key to one of the following recommended configuration directories:
for system wide configuration files and
for user specific files. For more information, please read the Kappa_User_Guide.pdf or the Kappa Library User Guide online.

ASYNC attribute and SQL and Expand keywords.

Output from the following example (with a few print statements added) on a dual core, midrange system processing four million rows in parallel is:

/usr/bin/time BUILD/m64/opengl/ikappa/ikappa sqltest/read.k
number of categories: 4 categories: 1 2 3 4
1048576 rows 40 bytes per row 65536 rows per batch
1048576 rows 40 bytes per row 65536 rows per batch
1048576 rows 40 bytes per row 65536 rows per batch
1048576 rows 40 bytes per row 65536 rows per batch
number of loops: 16 = 1048576 / 65536
number of loops: 16 = 1048576 / 65536
number of loops: 16 = 1048576 / 65536
number of loops: 16 = 1048576 / 65536
Processing time: 2730.76 (ms)
8.69user 2.00system 0:22.84elapsed 46%CPU (0avgtext+0avgdata 3821920maxresident)k
0inputs+0outputs (0major+233182minor)pagefaults 0swaps

The example output shown here is for a table in star format containing four million rows and, besides the primary key and the category dimension, six other dimension fields and three measure fields consisting of an integer field, a float field and a double field. This shows less than three seconds for the data transfer to the GPU and less than 23 seconds total program execution time for a seven dimensional hypercube with three measures with four million data points. The numbers shown correspond to a data transfer bandwidth utilization greater than 45% of maximum bandwidth. These numbers are shown to help you with your sizing–using systems with quad core or more processor cores and with higher speed memory transfer components will give correspondingly higher throughput. This is OLAP that you can afford.

The following example shows the combined usage of the SQL and the Expand keywords to dynamically size and run in parallel a task to retrieve data from a SQL data source for processing by a GPU. This example assumes a database table in standard star format named STAR_TABLE that has a a field, cat_pk_sid, that is usable for splitting the processing into parallel operations. This field would generally have a foreign key relationship to a master table that defines the permissible values for this field.
This example consists of three Kappa subroutines: sqlio, sqlprocess, and sqlread. The subroutine sqlio is unrolled within the sqlprocess subroutine using the Subroutine statement. The subroutine sqlprocess is expanded in the sqlread subroutine which is invoked in the main Kappa scheduling script—it also expands the labels in the sqlio subroutine.
The SQL keyword read commands in the sqlio subroutine (and their corresponding select commands in the sqlprocess subroutine) are executed asynchronously. The CUDA/Kernel launches in the sqlio subroutine use the same stream identifier as the corresponding Variable creation statements in the sqlprocess subroutine and so they execute on the same CUDA streams as the Variables use for data transfer. Since the streams are expanded, the data transfers are overlapping with other data transfers and kernel launches and, if a suitable (GF100) GPU is being used, the kernel launches give concurrent kernel execution.
The SQL operations on the dbhandle_$a are expanded and so, if they have an ASYNC=true attribute, run asynchronously in parallel.
This example is able to execute the SQL operations in parallel and the CUDA kernels concurrently at very high speed on commodity multi-core CPU and GF100 hardware.

Example setup

To setup this example, you would need to have installed an Apache Portable Runtime database driver for your database as well as the driver client for the database. In the example below, PostgreSQL is used with the driver packaged as apr-util-pgsql.

The example schema is available: sqlstar_example.architect and is a file for SQL Power Architect available here as open source and available here as commercial software. A diagram for the schema is available in the Kappa_User_Guide.pdf or the Kappa Library User Guide online. The ddl_load.k file may be used to create the schema and load the data–this script is meant to stress the Kappa library–not to be the most efficient way to create a schema and load data! The CUDA kernel used with this example (that does nothing) is

For this example, the PGPARAMS is stored in a configuration file, in the [/Kappa/Settings/USER]
section, and looks approximately like:

PGPARAMS=host=cosmos port=5432 dbname=kappat user=pgquery password=mypassword

The sqlread.k example:

<kappa subroutine=sqlio labels='$a' labelset='sql'>
// The main IO loop
!SQL ASYNC=true FAST=true -> read@dbhandle_$a(OUT_$a, #chunk_size, #rows_read_$a);
!CUDA/Kernel STREAM=str_$a OUTPUT_WARNING=false -> sqltest@sqltest(OUT_$a, #rows_read_$a) [ = OUT_$a #rows_read_$a];

<kappa subroutine=sqlprocess labels='$a' labelset='sql'>
!SQL -> connect@dbhandle_$a('pgsql',{PGPARAMS});
!SQL ASYNC=true STAR=true -> select@dbhandle_$a('select pk_sid, dima, dimb, dimc, dimd, dime, dimf, measurea, measureb, measurec from star_table where cat_pk_sid= %u order by dima;', $a, Categories, '=%lu %u %u %u %u %u %u +%f %u %lf', #num_rows_$a, #num_cols_$a, #row_size_$a);

// Get the number of rows to process at once using an if evaluation.
!Value -> rows_allocate_$a = if ( ( #chunk_size < #num_rows_$a ) , #chunk_size , #num_rows_$a );
!Variable STREAM=str_$a VARIABLE_TYPE=%KAPPA{LocalToDevice} -> OUT_$a(#rows_allocate_$a, #row_size_$a);

// Calculate how many iterations based on the number of rows and
// how many rows to process at once.
!Value -> numloops_$a = ( #num_rows_$a / #chunk_size );

// Perform a synchronization so the #numloops_$a Value is ready
!Synchronize (#numloops_$a);
!Print ('number of loops: ', #numloops_$a, ' = ' , #num_rows_$a, ' / ' , #chunk_size );

!Subroutine LABELSET='sql' UNROLL=true LOOP=#numloops_$a -> sqlio;
!SQL -> disconnect@dbhandle_$a(); // disconnect dbhandle

<kappa subroutine=sqlread>

!CUDA/Module -> sqltest = 'sqltest/';

//Set the size of the data to process at once
!Value -> chunk_size = 65536;

// Connect to the database and get the categories to use for splitting into parallel processes
!SQL -> connect@dbmain('pgsql',{PGPARAMS});
!SQL -> select@dbmain('select distinct cat_pk_sid from star_table;', '%u', #num_rows_cat, #num_cols_cat, #row_size_cat);
!Variable -> Categories(#num_rows_cat,#row_size_cat);
!SQL -> read@dbmain(Categories,#num_rows_cat,#rows_read_cat);
!SQL -> disconnect@dbmain();

!Value -> cat_indice = Categories;
!Print ( 'number of categories: ', #rows_read_cat, 'categories: ', #cat_indice);
// Synchronize the Value of how many categories so that Expand can use it as an argument
!Synchronize (#rows_read_cat);

// Expand and run the processing in parallel across the categories
!Expand LABELSET=sql -> sqlprocess(#rows_read_cat);

// Unload, cleanup, stop
!CUDA/ModuleUnload -> sqltest;

// Setup the CUDA context and load the CUDA module
!Context -> context;

!Subroutine -> sqlread;

!ContextReset -> Context_reset;

SchedulerShared Library

The previous example can be output to a C++ CMake project. This creates a C++ file for each subroutine: sqlio, sqlprocess, and sqlread by using the Process::OutputRoutines method which may be invoked using the ikappa ‘-o’ option. This creates, at a minimum, the files: CMakeLists.txt, sqlio.cpp, sqlprocess.cpp, and sqlread.cpp. Usually the CMakeLists.txt file should be changed to change the project name and output shared library. Running CMake and make on the project creates a shared library. Assuming that the project was not changed, then running:

cmake .

will create:

Using a file named subread.k which contains:

!Context -> context;

!Subroutine -> sqlread;

!ContextReset -> Context_reset;
//!Context -> context;

and putting the file in a sqltest subdirectory, lets the subroutines be loaded and executed from the shared library using the Process::LoadRoutine methods using the following command:

ikappa -m ./ -f sqlio -f sqlprocess -f sqlread subread.k

This loads the sqlio, sqlprocess, and sqlread subroutines and then, using the subread.k file, executes the sqlread subroutine (which calls the other two).

The C++ files could be manually written, but it is usually easier to use a scheduling script to create them.

Kappa Subroutines and Functions

Quick Start coming soon–meanwhile, please read the Kappa_User_Guide.pdf or the Kappa Library User Guide online.

ikappa/example program overview

Quick Start coming soon–meanwhile, please read the Kappa_User_Guide.pdf or the Kappa Library User Guide online.


No comments yet.

Leave a Reply

You must be logged in to post a comment.