CS414 Assignment 7: CUDA Programming + Memory Management

The objective of this assignment is to expose you to a very different type of concurrency found in high performance computing: data parallelism and give you experience with one particular language that is growing in popularity: CUDA. The advent of inexpensive commodity hardware (GPUs) that can be programmed simple C extensions (CUDA) allows the mass market to take advantage of data-parallel computing, and many software vendors are adding GPU acceleration using CUDA.

Due 10pm, Thursday, Nov. 20

Introduction

CUDA

For this assignment, we will be using NVIDIA's CUDA, which allows programmers to use a C-like language to write programs for execution on a GPU. More information can be found in the Programming Guide and API Reference. You may also find this overview article from ACM Queue to be helpful.

We have provided you with an example CUDA program that computes the sum of two vectors. (The preceding link points to the updated version; the old version is still available if you need it, here.) It computes the sum on the GPU and on the CPU and then compares the two result vectors to check for correctness. It also reports performance statistics.

The actual kernel code is very simple; each thread simply sums a single element from each of the two input vectors and writes the result into the output vector.
__global__ void add_vectors_kernel(float *A, float *B, float *C, int N) {
	// Determine which element this thread is computing
	int block_id = blockIdx.x + gridDim.x * blockIdx.y;
	int thread_id = blockDim.x * block_id + threadIdx.x;
	
	// Compute a single element of the result vector (if the element is valid)
	if (thread_id < N) C[thread_id] = A[thread_id] + B[thread_id];
}

[kernel code for the vector addition example]

The code to prepare the GPU for execution and invoke the kernel is actually significantly longer than the kernel itself. It first transfers the input vectors to GPU memory, then launches the kernel, and finally transfers the result vector back to CPU memory.
// Returns the vector sum A + B (computed on the GPU)
float *GPU_add_vectors(float *A_CPU, float *B_CPU, int N) {

	// Allocate GPU memory for the inputs and the result
	int vector_size = N * sizeof(float);
	float *A_GPU, *B_GPU, *C_GPU;
	cudaMalloc((void **) &A_GPU, vector_size);
	cudaMalloc((void **) &B_GPU, vector_size);
	cudaMalloc((void **) &C_GPU, vector_size);
	
	// Transfer the input vectors to GPU memory
	cudaMemcpy(A_GPU, A_CPU, vector_size, cudaMemcpyHostToDevice);
	cudaMemcpy(B_GPU, B_CPU, vector_size, cudaMemcpyHostToDevice);
	
	// Determine the number of thread blocks in the x- and y-dimension
	int num_blocks = (int) ((float) (N + threads_per_block - 1) / (float) threads_per_block);
	int max_blocks_per_dimension = 65535;
	int num_blocks_y = (int) ((float) (num_blocks + max_blocks_per_dimension - 1) / (float) max_blocks_per_dimension);
	int num_blocks_x = (int) ((float) (num_blocks + num_blocks_y - 1) / (float) num_blocks_y);
	dim3 grid_size(num_blocks_x, num_blocks_y, 1);
	
	// Execute the kernel to compute the vector sum on the GPU
	add_vectors_kernel <<< grid_size , threads_per_block >>> (A_GPU, B_GPU, C_GPU, N);
	
	// Allocate CPU memory for the result
	float *C_CPU = (float *) malloc(vector_size);
	
	// Transfer the result from the GPU to the CPU
	cudaMemcpy(C_CPU, C_GPU, vector_size, cudaMemcpyDeviceToHost);
	
	// Free the GPU memory
	cudaFree(A_GPU);
	cudaFree(B_GPU);
	cudaFree(C_GPU);
	
	return C_CPU;
}
[host code for the vector addition example]

Once you have downloaded and uncompressed the source files, run make to compile the program. Run the program by typing ./vector_add and optionally provide the number of elements in each vector. For example, running ./vector_add 30000000 might produce the following output:
Vector generation: 2.29549 sec

GPU:      Transfer to GPU: 0.37464 sec
         Kernel execution: 0.00732 sec
        Transfer from GPU: 0.18250 sec
                    Total: 0.58041 sec

CPU: 0.19306 sec

CPU outperformed GPU by 3.01x

All values correct

[example output of vector addition program]

You may notice something interesting about the performance results. Even though the actual computation time (Kernel execution above) on the GPU is about 25 times faster than the computation time on the CPU, the overall execution time on the CPU is about 3 times faster. This is because the execution time of the GPU version is dominated by the overhead of copying data between the CPU and GPU memories. This is an important lesson for CUDA developers: it only makes sense to execute something on the GPU when there is a significant amount of computation being performed on each data element.


Problem 1 - Find the Maximum

Problem: Write a CUDA program that, given an N-element vector, finds the largest element.

Your solution should take as input N and generate a randomized vector V of length N (as in the example program described above). It should then compute the maximum value in V on the CPU and on the GPU. The program should output the two computed maximum values as well as the time taken to find each value.

The following is an example of the expected program output (with N = 100,000,000 in this case):
Vector generation: 7.29615 sec

GPU:      Transfer to GPU: 0.59613 sec
         Kernel execution: 0.06669 sec
        Transfer from GPU: 0.00002 sec
                    Total: 0.68033 sec

CPU: 0.33637 sec

CPU outperformed GPU by 2.02x

GPU max: 224110944.000000
CPU max: 224110944.000000

[example output of maximum finding program]

Your goal is to find the maximum using CUDA as efficiently as you can. For the purposes of this assignment, we will ignore the overhead of copying the vector to the GPU. Thus, your goal is to minimize the rest of the execution time of the GPU solution. In other words, you should try to minimize the sum of the Kernel execution and Transfer from GPU times.

Hints:
  1. Make a copy of the vector_add source directory and modify this as a template for your new program. (Don't forget to give the files suitable names.)
  2. Remember that threads in different thread blocks cannot directly communicate. Start by assuming that the total number of elements in the vector is less than the number of threads in one thread block. After you have a correct solution for this case, try to extend it to support arbitrary vector sizes.
  3. When threads within the same thread block need to communicate, use shared memory rather than global memory. Accessing shared memory is orders of magnitude faster than accessing global memory. Remember also to synchronize all of the threads within a thread block using the __syncthreads() function where necessary.
  4. FLT_MIN (defined in float.h) is the smallest possible single-precision floating point number. This can be useful when you are trying to find the maximum value across an entire thread block but not every thread has a valid value. Similarly, FLT_MAX is the largest possible single-precision floating point number.

Problem 2 - Max, Min, and Mean

Problem: Extend your solution to the previous problem so that it efficiently finds and reports the following statistics:
  1. The maximum element in the vector
  2. The minimum element in the vector
  3. The arithmetic mean of the vector (the sum of all of the vector elements divided by the number of elements in the vector)
The following is an example of the expected program output (with N = 100,000,000 in this case):
Vector generation: 7.28857 sec

GPU:      Transfer to GPU: 0.53311 sec
         Kernel execution: 0.08253 sec
        Transfer from GPU: 0.00006 sec
                    Total: 0.63385 sec

CPU: 0.62369 sec

CPU outperformed GPU by 1.02x

GPU max: 224110944.000000
CPU max: 224110944.000000

GPU min: 0.000000
CPU min: 0.000000

GPU mean: 224110944.000000
CPU mean: 224110944.000000

[example output of maximum, minimum, and mean finding program]

Notice that, because the amount of computation per data element has increased, the CPU and GPU implementations have approximately equal performance. If we were to increase the amount of computation further, perhaps by computing another statistical measure such as the standard deviation, we would expect the GPU implementation to provide better performance than the CPU implementation.

Hints:
  1. Rather than creating one kernel for each of the three statistics, try to compute all of the statistics in a single kernel. Computing the statistics concurrently can significantly improve performance, because the overhead of launching a kernel is non-negligible. For extra credit, do it both ways (separately and together) and measure the difference.
  2. Recall that, even though the addition of real numbers is commutative and associative, in general floating-point addition is not. Thus, it is possible (and even likely) that computing the mean on the GPU will produce a slightly or even significantly different result than on the CPU. In fact, it is likely that the mean value computed by the GPU will be significantly more accurate for very large vector sizes. (These precision issues are one reason why debugging parallel programs is hard!)

Submission Instructions

You will work on the 002a lab machines in linux (devlab1 - devlab13.cs.virginia.edu). I suggest to access the machines remotely via ssh or similar. To spread yourselves out among the machines, I suggest to log into devlabX, where X = first letter of your last name MOD 13. If a machine is down, try another one and let me know.

You will need to get a CS departmental account.

Important: You will need to make sure your shell knows which paths to look in libraries. Make sure your LD_LIBRARY_PATH includes /usr/local/cuda/lib -- depending on your shell, the syntax for doing this will vary. For bash users, one way to do this is:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib
But you have to do this everytime you log in. You may want to add /usr/local/cuda/lib to your .profile or .bashrc. If you are encountering problems with other libraries, the error message usually tells you which paths are missing. You can also run
ldd [executable_file]
to see which libraries your executable uses and which ones it cannot find.

To submit this assignment, tar up all your code (Makefile and *.cu) as well as your README file, and upload this to Collab.

  tar jcvf assign7.tar.bz2 Makefile *.cu README

  [archiving all the files]

The easiest way to get your tar file from the 002 machines is to use a secure ftp program like SecureFX.

In your README file, include the names and UVa ID's of both partners. Also, briefly explain the algorithms you implemented and why you chose them, and any information the TAs will need to grade your assignment. You may also want to discuss the sequence of optimization steps you went through to arrive at your final algorithm. We suggest you start with the simplest solution first and then gradually refine. We'd rather see working code that isn't well optimized than well optimized code that doesn't run.

Don't forget the assigned book problems:

Submit everything to Collab no later than 10:00pm on Thursday, Nov. 20th.


Copyright Michael Boyer and Kevin Skadron, 2008.
Last revised Nov. 18 2008.