Psi Lambda LLC | Parallel computing made practical.

Sep/10

15

Kappa Quick Start Guide for Macintosh

In this quick start guide, it should be apparent that Kappa makes things simple but that it is a superset of the CUDA runtime. Kappa gives access to the CUDA driver functionality–see the Kappa User Guide and the Kappa Reference Manual for details. Kappa does not need the NVIDA SDK and does not need the NVIDA Toolkit if PTX files are used.

Download and installation

The necessary software for the Kappa Library (including links for the prerequisites) is available here

To be able to compile the examples, please install XCode from the Apple developer web site. You will also need CMake installed.

Install Kappa by running the kappa-1.3.2-Darwin.dmg installer. This Quick Start Guide assumes that Kappa is installed in the default location: /usr/local. Please make sure to select the Development/Kappa Library Extras for installation (select Customize in the Installation Type step) in order to follow those sections of this Quick Start Guide.

Kappa Configuration Files

You must run the

kappa_config.pl

or

sudo kappa_config.pl

command to install the Kappa configuration files. The first form of the command installs the configuration files in .kappa.d directory of the user’s home directory for the user that runs the command. The second form of the command installs the configuration files in /etc/kappa.d for use by all users on the system.
Feel free to edit the configuration files in /etc/kappa.d or ~/.kappa.d. Please refer to the Kappa_User_Guide.pdf or the Kappa Library User Guide online documentation for details on configuration settings. A copy of the original configuration files can be found in the /usr/local/share/kappa/conf.d folder.

Quick Start Example Files

The following files are referred to in the following examples:

Kappa installation verification

The check.k Kappa script may be used to verify that Kappa is installed properly and is compatible with the system it is installed on by checking some basic functionality. Save the check.k script to a file named check.k (make sure that the enclosing <kappa> and </kappa> tags are present–Kappa only processes content enclosed by these tags). The script can then be executed by running the following command within the terminal window:

ikappa check.k

Details on the ikappa program and the Kappa language are given in the Kappa_User_Guide.pdf or the Kappa Library User Guide online. Source code for ikappa is installed (by default in /usr/local/share/kappa/extras/ikappa) by selecting extras during installation.

    This script:

  • uses the !CUDAConfig; command to load the CUDA attributes for the device as configuration values,
  • prints these attributes of the CUDA device,
  • prints a value from the cuda_translation.conf file (the CUDA_VERSION value),
  • creates a CUDA context,
  • stops the background execution engine,
  • and finishes (exits)

  • !CUDAConfig;
    !Print ( 'Name', /Kappa/CUDA/GPU/Current#Name );
    !Print ( 'Major', /Kappa/CUDA/GPU/Current#Major );
    !Print ( 'Minor', /Kappa/CUDA/GPU/Current#Minor );


    !Print ( 'Recommended CUDA Version:', %CUDA{CUDA_VERSION} );
    !Context -> context;
    !Stop;
    !Finish;

The output from running it should look something similar to the following:


Kappa demonstration mode
Name GeForce GTX 470
Major 2
Minor 0
GlobalMemory 1341849600
ConstantMemory 65536
SharedMemoryPerBlock 49152
RegistersPerBlock 32768
WarpSize 32
MaxThreadsPerBlock 1024 bKappaRoutines.0.0.0.dylib
Makefile libKappaRoutines.0.dylib
Makefile.am libKappaRoutines.dylib
MaxThreads 1024 1024 64
MaxGridSize 65535 65535 1
MemoryPitch 2147483647
TextureAlignment 512
ClockRate 1215000
MultiProcessorCount 14
ConcurrentCopyExecute 1
KernelTimeout 0
Integrated 0
CanMapHostMemory 1
ComputeMode 0
MaximumTexture1DWidth 8192
MaximumTexture2DWidth 65536
MaximumTexture2DHeight 65536
MaximumTexture3DWidth 2048
MaximumTexture3DHeight 2048
MaximumTexture3DDepth 2048
MaximumArrayWidth 16384
MaximumArrayHeight 16384
MaximumArraySlices 2048
SurfaceAlignment 512
ConcurrentKernels 1
ECCSupported 0
Recommended CUDA Version: 3000
Number of variables: 0
Number of CommandQueue items: 35

The line:

Kappa demonstration mode

and the lines:

Number of variables: 0
Number of CommandQueue items: 35

will not appear if a Software License Key file is properly installed. These messages appear when the Kappa Library is running in the free mode.

C (C++) and CUDA Modules

In the following example, you will see that Kappa Values are dynamic and allow dynamic resizing even for kernel launch grid, blockshape, etc. The example shows Values being retrieved from configuration values and calculations but please note that CUDA and C (C++) modules can change them also (please read the Kappa_User_Guide.pdf or the Kappa Library User Guide online for more details). You will also see that you may dynamically use compiled kernel attributes such as RegistersPerThread. CUDA modules are JIT compiled so that the binary version is optimized for real GPU usage.

Kappa automatically schedules kernel execution based on dependencies. This allows for concurrent or out-of-order kernel execution without explicit stream management. A C (C++) or CUDA kernel may be dependent on Kappa Values or Variables. A Kappa ‘map’ is used to specify the input and output dependencies of kernels. A map is given as part of a kernel statement and looks similar to the following:

[ C = A B #WA #WB ]

where, within the square brackets, is a comma delimited list of output and input specifications. The above example shows the map for a kernel that has the Variable ‘C’ as an output, the Variable’s ‘A’ and ‘B’ and the Values ‘WA’ and ‘WB’ as inputs. In other words, the kernel depends on ‘A’, ‘B’, ‘WA’, and ‘WB’, and produces or changes ‘C’, which other kernels or statements may depend on. Variable and Value statements (as well as other statements) have implicit map dependencies–for example, creating a Variable or Value makes it available for other statements to depend on.

The example also demonstrates OpenMP functionality. Kappa does not rely on OpenMP for its internal functioning but does enable it for C (C++) module usage.

Module example setup

To try the following example you will need to have the g++ compile environment installed properly. Proper installation of XCode will install the g++ compile environment.

Download the quickstart.tar.gz file and then run the following commands in a terminal window:

tar zxvf quickstart.tar.gz
cd quickstart
pwd

Edit the user.conf file (in the newly created quickstart directory) and change </My/Path/To/QuickStart> to the output path that was printed by the the pwd command. Then, to put this configuration file where Kappa will use it, execute the following commands:

mkdir ~/.kappa.d
cp user.conf ~/.kappa.d/.

To compile the libTestModule dynamic library module, execute the following commands:

cd TestModule
sh compile.sh
cd ..

The commands in the compile.sh file assume that you have g++ properly installed, that the ‘-fPIC’ and ‘-fopenmp’ options to g++ work correctly on your system and that the ‘-shared’ and ‘-dylinker_install_name’ g++ linker options also work correctly on your system.
If the proceeding commands have executed correctly, then running the command:

ikappa modules.k

will produce output similar to the following:

Kappa demonstration mode
PTXVersion BinaryVersion
Hello from thread Hello from thread 01, nthreads , nthreads 22 and arg: and arg: 128128

MaxThreadsPerBlock 512 RegistersPerThread 13
StaticSharedMemory 2084 ConstantMemory 0 ThreadLocalMemory 0
Test PASSED
Device: Starting Free Memory: 201195520
Ending Free Memory: 201195520
Difference Memory: 0
Total: 266010624
Used: 86016
Number of variables: 3
Number of CommandQueue items: 29

The above example used the PTX file in preference to the CUDA ‘.cu’ file since it is newer. If you have the NVIDIA Toolkit 3.1 installed, you may edit /etc/kappa.d/kappa.conf and change the CUDA_PATH and NVCC_PTX to have appropriate values–change every occurrence of ‘/usr/local/cuda’ to be the correct path to your installation of the NVIDIA Toolkit. Then if you remove the cuda/matrixMul_kernel.ptx file and run the command:

ikappa modules.k

Kappa should compile the CUDA ‘.cu’ file using ‘nvcc’ to (re)produce the cuda/matrixMul_kernel.ptx file. Subsequent runs of the ikappa command will use the existing cuda/matrixMul_kernel.ptx file.

Here is a brief tour of the statements in the modules.k script–full details are in the Kappa User Guide:

  • A Context statement to create a CUDA context.

  • !Context -> context;

  • Value statements to configure the dimensions of A, B, and C matrices.

  • !Value -> WA = (3 * {BLOCK_SIZE}); // Matrix A width
    !Value -> HA = (5 * {BLOCK_SIZE}); // Matrix A height
    !Value -> WB = (8 * {BLOCK_SIZE}); // Matrix B width
    !Value -> HB = #WA; // Matrix B height
    !Value -> WC = #WB; // Matrix C width
    !Value -> HC = #HA; // Matrix C height

  • Statements to load the libTestModule C++ dynamic library module and the CUDA matrixMul kernel module. These statements use configuration values defined in the user.conf file to find the correct paths and files.

  • !C/Module -> testmodule={CMODULE};
    !CUDA/Module MODULE_TYPE=%KAPPA{CU_MODULE} -> matrixMul = {CUDAMODULE};

  • Statements to create Variables for the A and B matrices.

  • !Variable -> A(#WA,#HA,%sizeof{float});
    !Variable -> B(#WB,#HB,%sizeof{float});

  • Statements to call the C++ randomInit functions to initialize the A and B matrices.

  • !C/Kernel MODULE='testmodule' -> randomInit (A,{A_SIZE}) [A];
    !C/Kernel -> randomInit@testmodule (B,{B_SIZE}) [B];

  • A statment to create a Variable for the C matrix and initialize it.

  • !Variable VARIABLE_TYPE=%KAPPA{Device} DEVICEMEMSET=true
    -> C(#WC,#HC,%sizeof{float});

  • A statement to do a matrix multiplication by calling the matrixMul kernel (from the NVIDIA SDK example).

  • !CUDA/Kernel
    GRID=[ 8, 5 ]
    BLOCKSHAPE=[ {BLOCK_SIZE} , {BLOCK_SIZE} ]
    SHAREDMEMORY=( 2 * {BLOCK_SIZE} * {BLOCK_SIZE} * %sizeof{float} )
    -> matrixMul@matrixMul(C,A,B,#WA,#WB) [ C = A B #WA #WB ];

  • A statement to check the result of the matrixMul kernel by calling CheckGold C++ function (based on the function from the NVIDIA SDK example)

  • !C/Kernel -> CheckGold@testmodule(A,B,C,#HA,#WA,#WB,#HC,#WC)
    [ = A B C #HA #WA #WB #HC #WC ];

  • Statements to free the Variables for the matrices

  • !Free -> A;
    !Free -> B;
    !Free -> C;

  • Statements to load the matrixMul kernel attributes into Values and print them

  • !CUDA/Kernel/Attributes MODULE=matrixMul -> matrixMul;
    !Print ( 'MaxThreadsPerBlock',
    /kappa/CUDA/matrixMul/matrixMul#MaxThreadsPerBlock,
    'RegistersPerThread',
    /kappa/CUDA/matrixMul/matrixMul#RegistersPerThread );
    !Print ( 'StaticSharedMemory',
    /kappa/CUDA/matrixMul/matrixMul#StaticSharedMemory,
    'ConstantMemory',
    /kappa/CUDA/matrixMul/matrixMul#ConstantMemory,
    'ThreadLocalMemory',
    /kappa/CUDA/matrixMul/matrixMul#ThreadLocalMemory );
    !Print ( 'PTXVersion', /kappa/CUDA/matrixMul/matrixMul#PTXVersion,
    'BinaryVersion', /kappa/CUDA/matrixMul/matrixMul#BinaryVersion );

  • A statement to call an OpenMP function

  • !C/Kernel -> OpenMP@testmodule(#WB) [ = #WB ];

  • Statements to unload the CUDA and C++ modules
  • NOTE: Due to an NVIDIA bug, CUDA/ModuleUnload may cause memory corruption and should be avoided, for now, on the Macintosh.

    !CUDA/ModuleUnload -> matrixMul;
    !C/ModuleUnload -> testmodule;

  • A statement to reset the context.

  • !ContextReset -> Context_reset;

  • The Context statement again to report the device memory usage.

  • !Context -> context;

  • The statements to Stop and Finish.

  • !Stop;
    !Finish;

Software License Key installation

Buying a Software License Key file (and properly installing it) removes the extra messages about running in demonstration mode and of the number of operations or variables accessed. It also removes the limitations of: 10 variables, 125 puts, no OutputRoutines, and one Kappa instance.

Software License Key files and Kappa Library licensing are cross platform–system platform is not a consideration for licensing and the same Software License Key file works on all Kappa Library platforms.

To obtain a Software License Key file, follow the instructions on the License Key(s) page. Once a button for your Software License Key file appears on that page, click the button to download the file and save the file. Copy the Software License Key to one of the following recommended configuration directories:
“/etc/kappa.d”
for system wide configuration files and
“~/.kappa.d”
for user specific files. For more information, please read the Kappa_User_Guide.pdf or the Kappa Library User Guide online.

ASYNC attribute and SQL and Expand keywords.

Output from the following example (with a few print statements added) on a dual core, midrange system processing four million rows in parallel is:

/usr/bin/time BUILD/m64/opengl/ikappa/ikappa sqltest/read.k
number of categories: 4 categories: 1 2 3 4
1048576 rows 40 bytes per row 65536 rows per batch
1048576 rows 40 bytes per row 65536 rows per batch
1048576 rows 40 bytes per row 65536 rows per batch
1048576 rows 40 bytes per row 65536 rows per batch
number of loops: 16 = 1048576 / 65536
number of loops: 16 = 1048576 / 65536
number of loops: 16 = 1048576 / 65536
number of loops: 16 = 1048576 / 65536
Processing time: 2730.76 (ms)
8.69user 2.00system 0:22.84elapsed 46%CPU (0avgtext+0avgdata 3821920maxresident)k
0inputs+0outputs (0major+233182minor)pagefaults 0swaps

The example output shown here is for a table in star format containing four million rows and, besides the primary key and the category dimension, six other dimension fields and three measure fields consisting of an integer field, a float field and a double field. This shows less than three seconds for the data transfer to the GPU and less than 23 seconds total program execution time for a seven dimensional hypercube with three measures with four million data points. The numbers shown correspond to a data transfer bandwidth utilization greater than 45% of maximum bandwidth. These numbers are shown to help you with your sizing–using systems with quad core or more processor cores and with higher speed memory transfer components will give correspondingly higher throughput. This is OLAP that you can afford.

The following example shows the combined usage of the SQL and the Expand keywords to dynamically size and run in parallel a task to retrieve data from a SQL data source for processing by a GPU. This example assumes a database table in standard star format named STAR_TABLE that has a a field, cat_pk_sid, that is usable for splitting the processing into parallel operations. This field would generally have a foreign key relationship to a master table that defines the permissible values for this field.
This example consists of three Kappa subroutines: sqlio, sqlprocess, and sqlread. The subroutine sqlio is unrolled within the sqlprocess subroutine using the Subroutine statement. The subroutine sqlprocess is expanded in the sqlread subroutine which is invoked in the main Kappa scheduling script—it also expands the labels in the sqlio subroutine.
The SQL keyword read commands in the sqlio subroutine (and their corresponding select commands in the sqlprocess subroutine) are executed asynchronously. The CUDA/Kernel launches in the sqlio subroutine use the same stream identifier as the corresponding Variable creation statements in the sqlprocess subroutine and so they execute on the same CUDA streams as the Variables use for data transfer. Since the streams are expanded, the data transfers are overlapping with other data transfers and kernel launches and, if a suitable (GF100) GPU is being used, the kernel launches give concurrent kernel execution.
The SQL operations on the dbhandle_$a are expanded and so, if they have an ASYNC=true attribute, run asynchronously in parallel.
This example is able to execute the SQL operations in parallel and the CUDA kernels concurrently at very high speed on commodity multi-core CPU and GF100 hardware.

Example setup

To setup this example, you would need to have installed an Apache Portable Runtime database driver for your database as well as the driver client for the database. The Apache Portalbe Runtime database drivers for FreeTDS, Oracle, PostgreSQL, and unixODBC are installed by the Kappa installer but you also need the client database driver for the particular database to be used. For this example, you will need the PostgreSQL database client driver in addition to the Apache Portalbe Runtime database driver.

The example schema is available: sqlstar_example.architect and is a file for SQL Power Architect available here as open source and available here as commercial software. A diagram for the schema is available in the Kappa_User_Guide.pdf or the Kappa Library User Guide online. The ddl_load.k file may be used to create the schema and load the data–this script is meant to stress the Kappa library–not to be the most efficient way to create a schema and load data! The CUDA kernel used with this example (that does nothing) is sqltest.cu.

For this example, the PGPARAMS is stored in a configuration file, in the [/Kappa/Settings/USER]
section, and looks approximately like:

[/Kappa/Settings/USER]
PGPARAMS=host=cosmos port=5432 dbname=kappat user=pgquery password=mypassword

The sqlread.k example:



<kappa subroutine=sqlio labels='$a' labelset='sql'>
// The main IO loop
!SQL ASYNC=true FAST=true -> read@dbhandle_$a(OUT_$a, #chunk_size, #rows_read_$a);
!CUDA/Kernel STREAM=str_$a OUTPUT_WARNING=false -> sqltest@sqltest(OUT_$a, #rows_read_$a) [ = OUT_$a #rows_read_$a];
</kappa>

<kappa subroutine=sqlprocess labels='$a' labelset='sql'>
!SQL -> connect@dbhandle_$a('pgsql',{PGPARAMS});
!SQL ASYNC=true STAR=true -> select@dbhandle_$a('select pk_sid, dima, dimb, dimc, dimd, dime, dimf, measurea, measureb, measurec from star_table where cat_pk_sid= %u order by dima;', $a, Categories, '=%lu %u %u %u %u %u %u +%f %u %lf', #num_rows_$a, #num_cols_$a, #row_size_$a);

// Get the number of rows to process at once using an if evaluation.
!Value -> rows_allocate_$a = if ( ( #chunk_size < #num_rows_$a ) , #chunk_size , #num_rows_$a );
!Variable STREAM=str_$a VARIABLE_TYPE=%KAPPA{LocalToDevice} -> OUT_$a(#rows_allocate_$a, #row_size_$a);

// Calculate how many iterations based on the number of rows and
// how many rows to process at once.
!Value -> numloops_$a = ( #num_rows_$a / #chunk_size );

// Perform a synchronization so the #numloops_$a Value is ready
!Synchronize (#numloops_$a);
!Print ('number of loops: ', #numloops_$a, ' = ' , #num_rows_$a, ' / ' , #chunk_size );

!Subroutine LABELSET='sql' UNROLL=true LOOP=#numloops_$a -> sqlio;
!SQL -> disconnect@dbhandle_$a(); // disconnect dbhandle
</kappa>

<kappa subroutine=sqlread>

!CUDA/Module -> sqltest = 'sqltest/sqltest.cu';

//Set the size of the data to process at once
!Value -> chunk_size = 65536;

// Connect to the database and get the categories to use for splitting into parallel processes
!SQL -> connect@dbmain('pgsql',{PGPARAMS});
!SQL -> select@dbmain('select distinct cat_pk_sid from star_table;', '%u', #num_rows_cat, #num_cols_cat, #row_size_cat);
!Variable -> Categories(#num_rows_cat,#row_size_cat);
!SQL -> read@dbmain(Categories,#num_rows_cat,#rows_read_cat);
!SQL -> disconnect@dbmain();

!Value -> cat_indice = Categories;
!Print ( 'number of categories: ', #rows_read_cat, 'categories: ', #cat_indice);
// Synchronize the Value of how many categories so that Expand can use it as an argument
!Synchronize (#rows_read_cat);

// Expand and run the processing in parallel across the categories
!Expand LABELSET=sql -> sqlprocess(#rows_read_cat);

// Unload, cleanup, stop
!CUDA/ModuleUnload -> sqltest;
</kappa>

<kappa>
// Setup the CUDA context and load the CUDA module
!Context -> context;

!Subroutine -> sqlread;

!ContextReset -> Context_reset;
!Stop;
!Finish;
</kappa>

SchedulerShared Library

The previous example can be output to a C++ CMake project. This creates a C++ file for each subroutine: sqlio, sqlprocess, and sqlread by using the Process::OutputRoutines method which may be invoked using the ikappa ‘-o’ option. This creates, at a minimum, the files: CMakeLists.txt, sqlio.cpp, sqlprocess.cpp, and sqlread.cpp. Usually the CMakeLists.txt file should be changed to change the project name and output shared library. Running CMake and make on the project creates a shared library. Assuming that the project was not changed, then running:

cmake .
make

will create:

libKappaRoutines.dylib
libKappaRoutines.0.dylib
libKappaRoutines.0.0.0.dylib

Using a file named subread.k which contains:

<kappa>
!Context -> context;

!Subroutine -> sqlread;

!ContextReset -> Context_reset;
//!Context -> context;
!Stop;
!Finish;
</kappa>

and putting the sqltest.cu file in a sqltest subdirectory, lets the subroutines be loaded and executed from the shared library using the Process::LoadRoutine methods using the following command:

ikappa -m ./libKappaRoutines.dylib -f sqlio -f sqlprocess -f sqlread subread.k

This loads the sqlio, sqlprocess, and sqlread subroutines and then, using the subread.k file, executes the sqlread subroutine (which calls the other two).

The C++ files could be manually written, but it is usually easier to use a scheduling script to create them.

Kappa Subroutines and Functions

Quick Start coming soon–meanwhile, please read the Kappa_User_Guide.pdf or the Kappa Library User Guide online.

ikappa/example program overview

Quick Start coming soon–meanwhile, please read the Kappa_User_Guide.pdf or the Kappa Library User Guide online.

· ·

No comments yet.

Leave a Reply

You must be logged in to post a comment.

<<

>>

Articles