Psi Lambda LLC | Parallel computing made practical.



Changes for Kappa Version 1.5.0

Change log

Version 1.5.0 Changes

Fixed Windows distribution to work better on 64 bit Windows.
Update for compatibility with Field Forge™.
Added NullMasks. !GetNullMask and SetNullMask keywords.
Added RECOMPILE=true and NVCC_OPTIONS=string to !Cuda/Module.
Added !Kappa/Routine -> Load(‘routine’,'shared_library’); immediate keyword.
Fixed two argument form of Variable command.
Fixed the Synchronize command to execute IfCancel and IfFail.
Fixed downstream IF_CANCEL and IF_FAIL processing.

Version 1.4.0 Changes

Updated for compatibility with CUDA 3.2
Device address space forced to 64 bit on 64 bit hosts.
Texture references as arguments marked not operational and depecrated.
Fixed dependency processing of items that are in map but not argument list:
implicit arguments such as textures, surfaces, module constant variables, etc.
Added Surface keyword for binding arrays to surfaces.
Added Context/Attributes keyword that sets Values for:
MemoryFree, MemoryTotal, MemoryUsed, APIVersion, DriverVersion, CacheConfig,
DeviceID, ThreadStackSize, PrintfFIFOSize, HeapSize

Version 1.3.2 Changes

Added IOCallback abstract class that allows IO callbacks by class instantiation.
Added new Process::RegisterIOCallback for using these new classes.
Added examples of using the IOCallback class and the ExceptionHandler class to ikappa and wkappa examples.
Made Kappa exception handling more robust.

Version 1.3.1 Changes

.Net KappaCUDAnet and C# and Visual Basic examples.
Fixed base_path for Kappa Instance to be more robust under garbage collection.
Fixed OutputRoutine to output code for original name and labels.
Changed KappaParser to ignore carriage returns.
Fixed Windows version of Lock::Wait timeout handling.
Fixed DoNotExecute to still execute Stop and Finish keywords.
Jave JNI is now alpha status (should work–no example and not yet tested).
Fix Resource waiting on command set deallocation error.

Version 1.3.0 changes

SQL keyword added

This currently supports the APR database driver.
The APR database driver supports:
FreeTDS (MSSQL and Sybase)
The SQL keyword allows using SQL data sources to write to and read from Kappa Variables. Format strings specify the (binary) Variable layout mapping to/from database record fields. Full parallel transaction control is implemented. It is possible to subclass the kappa/DBD class to have the SQL keyword use your own driver instead of the APR database driver.

The SQL keyword supports star database layout for OLAP. It has built-in support for separate dimension and measure handling. Dimensions may be any data type–it has automatic support for assigning unsigned integer numbers as dimension labels for easier CUDA OLAP processing. The SQL keyword also supports using a numeric primary key for easy and compact row tracking and modification.

ASYNC attribute added for SQL and C/Kernel keywords

The ASYNC=true attribute allows the C/Kernel and the SQL Select, Read, Write, and Execute keyword commands to execute asynchronously. Note that the CUDA API’s do not allow access to most functionality from a C/Kernel if the ASYNC=true attribute is used.

Added Expand keyword and subroutine labels and labelsets

This allow creating subroutines as (tensor) indexed components that can be expanded dynamically at runtime. This gives a real, practical implementation that allows for true concurrent kernel execution and algorithm step sizing. Properly used, this allows maximum occupancy and use of GPU and CPU.

Labels can be placed in attribute values (such as stream ids), module names, kernel names, Value names, kernel arguments, and Variable names–among other places. These labels are then expanded (at runtime) with numeric ranges or Indices. Numeric ranges and Value Indices can be sliced from Variables using prior Kappa Library features. This allows for automatic parallelism and sizing using runtime data in a natural (tensor) index component manner. These labels can be used to create parallel execution dependency streams, vary across GPU or CPU, select/split/slice datasets for parallelism, select kernels, perform data parallel combinatoric expansions, etc.


Add OUTPUT_WARNING=false attribute for CUDA/Kernel.

Added NVIDIA 3.1 API STACK_SIZE attribute to CUDA/Kernel to set stack size limit (for GPU threads) (this gets reset when the context is reset).

Added NVIDIA 3.1 API PRINTF_FIFO_SIZE attribute to CUDA/Module to set the printf fifo size limit.

Added underscore and dollar sign to the set of allowed characters in names and arguments (and dollar sign is no longer an ‘expression’ character).

Various fixes.

Example using the new features in Kappa version 1.3:

Output from the following example (with a few print statements added) on a dual core, midrange system processing four million rows in parallel is:

/usr/bin/time BUILD/m64/opengl/ikappa/ikappa sqltest/read.k
number of categories: 4 categories: 1 2 3 4
1048576 rows 40 bytes per row 65536 rows per batch
1048576 rows 40 bytes per row 65536 rows per batch
1048576 rows 40 bytes per row 65536 rows per batch
1048576 rows 40 bytes per row 65536 rows per batch
number of loops: 16 = 1048576 / 65536
number of loops: 16 = 1048576 / 65536
number of loops: 16 = 1048576 / 65536
number of loops: 16 = 1048576 / 65536
Processing time: 2730.76 (ms)
8.69user 2.00system 0:22.84elapsed 46%CPU (0avgtext+0avgdata 3821920maxresident)k
0inputs+0outputs (0major+233182minor)pagefaults 0swaps

The example output shown here is for a table in star format containing four million rows and, besides the primary key and the category dimension, six other dimension fields and three measure fields consisting of an integer field, a float field and a double field. This shows less than three seconds for the data transfer to the GPU and less than 23 seconds total program execution time for a seven dimensional hypercube with three measures with four million data points. The numbers shown correspond to a data transfer bandwidth utilization greater than 45% of maximum bandwidth. These numbers are shown to help you with your sizing–using systems with quad core or more processor cores and with higher speed memory transfer components will give correspondingly higher throughput. This is OLAP that you can afford.

The following example shows the combined usage of the SQL and the Expand keywords to dynamically size and run in parallel a task to retrieve data from a SQL data source for processing by a GPU. This example assumes a database table in standard star format named STAR_TABLE that has a a field, cat_pk_sid, that is usable for splitting the processing into parallel operations. This field would generally have a foreign key relationship to a master table that defines the permissible values for this field.
This example consists of two Kappa subroutines: sqlio and sqlprocess. The subroutine sqlio is unrolled within the sqlprocess subroutine using the Subroutine statement. The subroutine sqlprocess is expanded in the main Kappa scheduling script—it also expands the labels in the sqlio subroutine.
The SQL keyword read commands in the sqlio subroutine (and their corresponding select commands in the sqlprocess subroutine) are executed asynchronously. The CUDA/Kernel launches in the sqlio subroutine use the same stream identifier as the corresponding Variable creation statements in the sqlprocess subroutine and so they execute on the same CUDA streams as the Variables use for data transfer. Since the streams are expanded, the data transfers are overlapping with other data transfers and kernel launches and, if a suitable (GF100) GPU is being used, the kernel launches give concurrent kernel execution.
The SQL operations on the dbhandle_$a are expanded and so, if they have an ASYNC=true attribute, run asynchronously in parallel.
This example is able to execute the SQL operations in parallel and the CUDA kernels concurrently at very high speed on commodity multi-core CPU and GF100 hardware.

<kappa subroutine=sqlio labels='$a' labelset='sql'>
// The main IO loop
!SQL ASYNC=true FAST=true -> read@dbhandle_$a(OUT_$a, #chunk_size, #rows_read_$a);
!CUDA/Kernel STREAM=str_$a OUTPUT_WARNING=false -> sqltest@sqltest(OUT_$a, #rows_read_$a) [ = OUT_$a #rows_read_$a];

<kappa subroutine=sqlprocess labels='$a' labelset='sql'>
!SQL -> connect@dbhandle_$a('pgsql',{PGPARAMS});
!SQL ASYNC=true STAR=true -> select@dbhandle_$a('select pk_sid, dima, dimb, dimc, dimd, dime, dimf, measurea, measureb, measurec from star_table where cat_pk_sid= %u order by dima;', $a, Categories, '=%lu %u %u %u %u %u %u +%f %u %lf', #num_rows_$a, #num_cols_$a, #row_size_$a);

// Get the number of rows to process at once using an if evaluation.
!Value -> rows_allocate_$a = if ( ( #chunk_size < #num_rows_$a ) , #chunk_size , #num_rows_$a );
!Variable STREAM=str_$a VARIABLE_TYPE=%KAPPA{LocalToDevice} -> OUT_$a(#rows_allocate_$a, #row_size_$a);

// Calculate how many iterations based on the number of rows and
// how many rows to process at once.
!Value -> numloops_$a = ( #num_rows_$a / #chunk_size );

// Perform a synchronization so the #numloops_$a Value is ready
!Synchronize (#numloops_$a);
!Print ('number of loops: ', #numloops_$a, ' = ' , #num_rows_$a, ' / ' , #chunk_size );

!Subroutine LABELSET='sql' UNROLL=true LOOP=#numloops_$a -> sqlio;
!SQL -> disconnect@dbhandle_$a(); // disconnect dbhandle

// Setup the CUDA context and load the CUDA module
!Context -> context;
!CUDA/Module -> sqltest = 'sqltest/';

//Set the size of the data to process at once
!Value -> chunk_size = 65536;

// Connect to the database and get the categories to use for splitting into parallel processes
!SQL -> connect@dbmain('pgsql',{PGPARAMS});
!SQL -> select@dbmain('select distinct cat_pk_sid from star_table;', '%u', #num_rows_cat, #num_cols_cat, #row_size_cat);
!Variable -> Categories(#num_rows_cat,#row_size_cat);
!SQL -> read@dbmain(Categories,#num_rows_cat,#rows_read_cat);
!SQL -> disconnect@dbmain();

!Value -> cat_indice = Categories;
!Print ( 'number of categories: ', #rows_read_cat, 'categories: ', #cat_indice);
// Synchronize the Value of how many categories so that Expand can use it as an argument
!Synchronize (#rows_read_cat);

// Expand and run the processing in parallel across the categories
!Expand LABELSET=sql -> sqlprocess(#rows_read_cat);

// Unload, cleanup, stop
!CUDA/ModuleUnload -> sqltest;
!ContextReset -> Context_reset;

Version 1.2.0 of the Kappa Library has the following changes from prior versions:

This version is compiled with the CUDA ToolKit 3.1 but does not yet support (writable) surfaces or changing the limits for kernel printf (cuCtxSetLimit).

Changes made in this version include:

  • Configurable stream pool size.
  • Added ExpandRoutine to Process and EXPAND attribute to the Subroutine keyword–allow treating subroutines as expandable macros.
  • Performance tweaks for concurrent kernel execution.
  • Process now assigning and appending a unique number always to unique names.
  • GF100 (FERMI) regression test fixes:
    * More robust lock handling on exceptions.
    * Kappa checked by Valgrind.
  • KappaParser changed to be more friendly:
    * Allow functions with empty parenthesis.
    * Skip less input on a syntax error.
    * Tokenize strings prior to parsing and allow escaped quotes.
  • Add IF_CANCEL, IF_FAIL, and IF_FINISH general attributes: the scheduler now allows commands to launch on cancel or fail. Commands must be rewritten to use this functionality–the Command base class defaults to disabling this functionality.
  • Commands now have a PriorStatus (set by the scheduler) that reports the previous resource producers’ status.
  • Directions for arguments may now be Direction_None, Direction_Cancel, or Direction_Fail if for commands executing after a cancel or fail.
  • · · · · · · · · · · · · · · ·