Psi Lambda LLC ψ(λκ) Kappa Library User Guide

Overview

Most people are familiar with procedural programming which is also known as imperative programming. This consists of specifying the steps, in order, that a program must take to reach the desired goal. This form of programming does not work well with parallel computing hardware such as multiple CPUs or GPUs. These types of hardware execute calculations simultaneously (in parallel). Instead of a “function” that specifies the steps, in order, that a program executes, parallel computing has kernels that take a set of data and simultaneously, in parallel, executes a series of calculations across the set of data. These calculations can either change the set of data where it was originally stored (in memory) and/or can write the results to some other (memory) storage. (One other difference is that kernels do not return a single value but instead return data by modifying one or more of the passed arguments.)

Kappa needs to keep track of the relationships for how data flows through a process so that it can be sure to schedule steps in the correct order. The arguments for C++ and CUDA functions and kernels do not specify whether a given argument is used for input, for output, or for both input and output. If the argument is a literal value such as a number or a string, then it has to be an input, but in other cases, specifically if the argument is a pointer or reference, there is no information about the arguments use and therefore no way to track the relationships between arguments of kernels and the possible flow flow of data.

This method for tracking data for parallel computation is usually called Producer/Consumer resource tracking. The data objects are considered resources and the processing steps are considered as producers and/or consumers of these resources. The idea is that each step in a process is either a producer of a particular resource, a consumer of a particular resource, or both a consumer and a producer if it takes a resource and modifies it.

So that Kappa can track the relationships of arguments and ensure the proper flow of data in scheduled for processing steps, Kappa introduces a new notation. To help understand this notation, examples will be discussed that refer to some hypothetical kernel, K, and data sets A, B, and C. These are data sets should be thought of as storage locations such as a filename or a memory storage address—the data at the names of those locations may be changed but the name of the location (filename) does not.

Here are the examples:
A is the argument to K. The data set A is transformed, in place, by the kernel, K.
A, B, and C are the arguments to K. The data set C is written to by the kernel, K, using A and B as inputs.
A, B, and C are the arguments to K. The data set C is written to by the kernel K, using A, B, and C as inputs.
A, B, and C are the arguments to K. The data sets A, B, and C are written to by the kernel K.
Kappa will use the following notation to show these relationships:
K(A) [ A = A ] or K(A) [A]
K(A,B,C) [C = A B ]
K(A,B,C) [ C = A B C ]
K(A,B,C) [ A B C = ]
where, within the parenthesis, other arguments that are not relevant to the proper execution order are also sometimes given:
K(A,4,5) [A]
K(A,B, 42, “a string of text”) [ C = A B ]
K(A,B,C, (void *)cpp_class, 50) [ C = A B C ]
and so on.

This notation provides the information needed to schedule calculations in the proper order—to define a process to accomplish (one or more) tasks. For example, statements like:
Variable A;
kernel1 (A) [A];
kernel2 (A,B) [ B = A ];
give a flow of the data of:
A -> A -> B
where the data set A is declared (produced), kernel1 transforms data set A, in place (takes the data in A, changes it, and writes it back to data set A--consumes A and produces a new A), and then kernel2 creates data set B from data set A (consumes A and produces B).

In the preceding discussion, it did not matter what the data in the data sets were or what the kernels were doing to the data—it could have been rocket science, biology, or grocery lists. In the following example, it will be two matrices being multiplied—but think of combining grocery lists if that helps.
Don't worry about understanding the details of the syntax of this example yet. It is given so that you can start comparing Kappa syntax with the discussion so that it starts to become familiar to you. This manual has plenty of pages describing the details of the syntax. Use the following pseudo code instead of you want:

Start a CUDA context.
Load the C and CUDA modules.
Create variables A and B and initialize them.
Create the variable C.
Multiply A and B and get C using the matrixMul kernel from the CUDA module matrixMul.

The real Kappa process example is the following:
<kappa>
// Start a CUDA context
!Context -> context;
// Load the C and CUDA modules
// (The file paths for {CMODULE} and {CUDAMODULE} must be setup
// in the configuration files.)
!C/Module -> testmodule={CMODULE};
!CUDA/Module -> matrixMul = {CUDAMODULE};
// Create variables A and B and initialize them
// (The value for %sizeof{float} must be in the sizeof.conf configuration file)
!Variable -> A(48,80,%sizeof{float});
!C/Kernel MODULE='testmodule' -> randomInit (A,3840) [A];
!Variable -> B(128,48,%sizeof{float});
!C/Kernel MODULE='testmodule' -> randomInit (B,6144) [B];
!Variable -> C(128,80,%sizeof{float});
// Multiply A and B and get C
!CUDA/Kernel GRID=[ 8, 5 ] BLOCKSHAPE=[ 16, 16 ]
SHAREDMEMORY=( 2 * 16 * 16 * %sizeof{float} )
-> matrixMul@matrixMul(C,A,B,48,128) [ C = A B ];
</kappa>

This example shows that not only that C is dependent on A and B, but that the matrixMul kernel is dependent on the matrixMul module and that the randomInit kernel is dependent on the testmodule module. Kappa will schedule execution in the dependency order and will cancel execution of all dependencies if there is an execution failure such as an inability to load a module file. Note that there is an implicit dependency of modules, variables, and kernels on the context.

The developer must give the statements to Kappa in the correct dependency order so that Kappa can infer the correct dependency relationships when scheduling execution. If there is no dependency between statements, then Kappa will still schedule them in the order given but they may execute in some other order. For example, the allocation and initialization of the A and B variables could happen in any order or simultaneously as long as each variable's allocation precedes its initialization. Kappa Variable, Array, and Texture object types are not distinguished when it comes to resource dependency names. In other words, make that the names are different regardless of whether the object is a Variable, an Array, or a Texture.

Some other features are worth pointing out that made the above example possible: Kappa is allocating the requested host and/or device memory when required and is executing the appropriate CUDA data copy routines, as needed, to make sure that the correct data is in the correct host or device memory location prior to executing either C or CUDA kernels.

This is a good time to explain the general syntax of the Kappa process language given in the above and subsequent examples. See the Kappa Process Language section for complete details. Kappa Process language statements must be enclosed in paired tags:

<kappa>

...Kappa language statements

</kappa>
Almost all Kappa process statements start with an exclamation mark, '!', followed by a keyword, and the statement ends with a semicolon. There is one exception to Kappa process statements starting with an exclamation followed by a keyword and that is the decision statement, which starts with a question mark, '?', does not have a keyword but still ends with a semicolon. It is technically possible to put spaces between the exclamation and the keyword—Kappa will be forgiving and allow you to, but it is best form to have the exclamation immediately prior to the keyword so that the keywords are easy to see. For the Kappa process statements that start with an exclamation and a keyword, there may be attributes, followed by an arrow, '->'. If there are no attributes for a statement, then the arrow is sometimes optional. The remainder of a Kappa process statement takes different forms depending on what it is for. All of the allowed forms are listed in the Kappa Process Language syntax section.

The above example is incomplete for cases where subsequent processing would occur. While Kappa would automatically cleanup all allocated resources when it exits, it is a good practice to let these be freed as soon as possible. This means that statements to free the variables, unload the modules, and/or reset the context should also be given:
<kappa>
!Free -> A;
!Free -> B;
!Free -> C;
!CUDA/ModuleUnload -> matrixMul;
!C/ModuleUnload -> testmodule;
!ContextReset -> Context_reset;
</kappa>

It is also a good practice to let the Kappa background process know that no further commands will be queued for execution so that it can clean up and exit:
<kappa>
!Stop;
!Finish;
</kappa>

The Kappa process example just given shows that C kernels may be intermixed with CUDA kernels. Even though they are called C kernels and must be declared as 'extern “C”', the C kernel may be C++ kernels and, if the compiler supports it, the C kernels may implement OpenMP parallelism. Kappa supports a variety of ways of using C++ kernels that provide different benefits such as extending the keywords that are recognized or allowing for easier callbacks. Extending the keywords of Kappa is meant to allow extension of Kappa functionality but, of much more importance, to allow for creating keyword commands that support a particular subject field. The callback functionality is meant to allow for easy data access between the Kappa Process and the hosting program.

The synchronous C kernels execute on the background process thread that is usually tied to a CUDA context. This means that the synchronous C kernels have the ability to call all of the Kappa lower level API or CUDA driver functions, CUBLAS, CUFFT or other CUDA related libraries or functions such as OpenGL or Direct3D in addition to being able to call database or other functions and libraries.

A Kappa Process uses the program thread it is invoked on to parse, prepare for execution, and determine dependencies. It creates a separate, background process thread for scheduling and execution. This background process thread is the thread associated to a CUDA context—not the original program thread. Kappa synchronous C kernels, commands, Keyword commands, IOCallbacks, exceptions, and scheduling status handling are also running on this background process thread and have access to the CUDA context. These must all return in a reasonably timely fashion so that other scheduling, execution, exception handling, etc. may occur.

The previous process example is not very realistic in the sense that all of the sizes of the Variable objects, the grid's, block shapes, and shared memory are all static numbers. Kappa has full support for dynamic sizing calculations at process execution time. Kappa has full support for accessing as dynamic Value objects:
and calculating and passing at execution time these parameters to
It also supports evaluation of values and canceling execution of scheduled dependencies based on these dynamic values. For proper dependency schedule execution, these dynamic values must be stated as dependencies, in the same way that Variable objects are, in C and CUDA kernel calls.

The values for the dimensions of the A variable could be calculated as shown in the following example:
// Get BLOCK_SIZE value from configuration
!Value -> BLOCK_SIZE = %USER{BLOCK_SIZE};
This !Value statement creates a new value called BLOCK_SIZE and gets the numeric value for it from a configuration file that contains:
[/Kappa/Settings/USER]
CMODULE_PATH=commandtest/TestModule/.libs/
CMODULE={CMODULE_PATH}libTestModule.so
CUDAMODULE=matrixMul_kernel.cu
BLOCK_SIZE= %{BLOCK_SIZE}
[/Kappa/Translation/USER]
BLOCK_SIZE= 16
the “%USER{BLOCK_SIZE}” can be understood to be a translation (denoted by the '%' or percent sign) in the "/Kappa/Translation/USER" section of a configuration file that contains a label “BLOCK_SIZE” which in this case has the value 16. “%{BLOCK_SIZE}” and "%/Kappa/Translation/USER{BLOCK_SIZE}” also refer to the same thing since the default path is "/Kappa/Translation" and the default section is “USER”. (The “%sizeof{float}” that was used in the very first example can now be seen to be looking for a label called “float” in the "/Kappa/Translation/sizeof" section of a configuration file and substituting its content.)

The statement:
!Value -> WA =(3 * {BLOCK_SIZE}); /* Matrix A width */
creates a value named WA which uses the {BLOCK_SIZE} USER setting
“BLOCK_SIZE=%{BLOCK_SIZE}”. This shows that configuration values can refer to other configuration values—in this case the “%{BLOCK_SIZE}” configuration translation value that was previously discussed. (If configuration values refer to each other in a circular manner so that they could never be resolve, Kappa will currently give up trying after 128 attempts.) In this example, the WA value multiplies the BLOCK_SIZE by the number three to get a numeric value of 48.

The statement:
!Value -> HA = (5 * #BLOCK_SIZE); /* Matrix A height */
illustrates using the previous Value BLOCK_SIZE. Unless it is unambiguous that it is a Value being referred to and not a Variable or other name, then a Value must contain a pound sign, '#' to denote that it is a value. When defining a value, then the name given, such as “HA” in the current example, is obviously referring to a value and so the pound sign is unnecessary.

The configuration file shown above also defines the {CMODULE} and {CUDAMODULE} configuration values used in the very first process example.

To clarify the difference between a Value and a Variable, Variable objects in Kappa are blocks of memory that would usually hold multidimensional arrays of data but can contain anything that is consistent with their usage by the C and CUDA kernels that use them. Variable objects can be both input and output arguments to the C and CUDA kernels (a Variable can be a CUDA texture but the current version of CUDA only supports textures as inputs). They are automatically allocated and their data contents copied between host and device memory as needed. They can be host memory allocated with malloc, graphics resources (OpenGL or Direct3D) mapped by CUDA, or, more generally, are host or device memory allocated by CUDA and used as GPU memory, module variables, and textures.

Dynamic Value objects are arranged in hierarchical namespaces within a Kappa Process Namespace object and are only allowed as inputs to CUDA kernels while they can be inputs and outputs for C kernels. (Note, however, that Kappa supplies statements to create integer, float, or vector integer (Indices) Value objects from the contents of Variable objects and that command::Keyword or IOCallback functions have no constraints at all on what they can do with Value objects or Variable objects except those imposed by the current CUDA driver API.)

For further information about what a developer can do with Variable objects and Value objects, please see the command::Keyword keyword/CSV source code example. It provides an illustration of how it is possible to extend the current capabilities of Kappa or provide subject area specialization. The keyword/CSV example provides a new keyword for reading csv data (in this example, with the CSV_DATA_TYPE attribute set to float and the file format set to CSV_TAB_DELIMITED) into a newly created variable, csv_variable, and returns the dimensions in the dynamic values, cols and rows:
!CSV CSV_DATA_TYPE=%KAPPA{FLOAT} CSV_TAB_DELIMITED=true -> get(csv_variable,#cols,#rows,'test.csv');
The function name, “get”, is not necessary in the example provided and could be omitted. Alternatively, it could be used by a different version of the CSV keyword to indicate whether to read or write the data from the file.

The CSV example illustrates that dynamics sizing is possible with Kappa. Also, please note that if the CSV command sets a canceled or failed status, then any subsequent scheduled commands that depend on its variable or values will have their execution canceled even if they are prepared and queued for execution in the process.

The final general topics worth mentioning in an overview is that:

Page