Cuda
-
Upload
mario-gazziro -
Category
Education
-
view
1.230 -
download
0
description
Transcript of Cuda
February 9, 2007
Uso de placas gráficas em computação de alto-desempenho
(High-performance computing using GPU´s)
Mario Alexandre Gazziro (YAH!)
Orientador: Jan F. W. Slaets
24/09/08
2/9/07 Course Title 2
Part I: Overview
Definition: Introduced in 2006, the Compute Unified Device Architecture is a
combination of software and hardware architecture (available for NVIDIA G80 GPUs and above) which enables data-parallel general purpose computing on the graphics hardware. It therefore offers a C-like programming API with some language extensions.
Key Points: The architecture offers support for massively multi threaded
applications and provides support for inter-thread communication and memory access.
2/9/07 Course Title 3
Why this topic is important?
Data-intensive problems challenge conventional computing architectures with demanding CPU,memory, and I/O requirements.
Emerging hardware technologies, like CUDA architecture can significantly boost performance of a wide range of applications by increasing compute cycles and bandwidth and reducing latency.
2/9/07 Course Title 4
Where would I encounter this?
Gaming
Raytracing
3D Scanners
Computer Graphics
Number Crunching
Scientific Calculation
2/9/07 Course Title 5
CUDA SDK sample applications
2/9/07 Course Title 6
CUDA SDK sample applications
2/9/07 Course Title 7
CUDA vs Intel
NVIDIA GeForce 8800 GTX vs Intel Xeon E5335 2GHz, L2 cache 8MB
2/9/07 Course Title 8
Grid of thread blocks
The computational grid consist of a grid of thread blocks
Each thread executes the kernel
The application specifies the grid and block dimensions
The grid layouts can be 1, 2 or 3-dimensional
The maximal sizes are determined by GPU memory
Each block has a unique block ID
Each thread has a unique thread ID (within the block)
2/9/07 Course Title 9
Elementwise Matrix Addition
2/9/07 Course Title 10
Elementwise Matrix Addition
The nested for-loops are replaced with an implicit grid
2/9/07 Course Title 11
Memory model
CUDA exposes all the different type of memory on GPU:
2/9/07 Course Title 12
Part II: Accelerating MATLAB with CUDA
12
Case Study: Initial calculation for solving sparse matrix in the method proposed by professor Guilherme Sipahi, from IFSC
N=1001;K(1:N) = rand(1,N);g1(1:2*N) = rand(1,2*N);k = 1.3;tic;for i=1:N for j=1:N M(i,j) = g1(N+i-j)*(K(i)+k)*(K(j)+k); endendmatlabTime=toc tic; M=guilherme_cuda(K,g1); cudaTime=toc
speedup=matlabTime/cudaTime
2/9/07 Course Title 1313
Results: Speedup of 4.77 times using a NVIDA 8400M with 128 MB
matlabTime =
10.6880
cudaTime =
2.2406
speedup =
4.7701
>>
2/9/07 Course Title 14
The MEX file structure
The main() function is replaced with mexFunction.
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { code that handles interface and calls to computational function return; }
mexFunction arguments:
- nlhs: The number of lhs (output) arguments.
- plhs: Pointer to an array which will hold the output data,each element is type mxArray.
- nrhs: The number of rhs (input) arguments.
- prhs: Pointer to an array which holds the input data, eachelement is type const mxArray.
2/9/07 Course Title 15
MX Functions
The collection of functions used to manipulate mxArrays are calledMX-functions and their names begin with mx.Examples:
• mxArray creation functions:mxCreateNumericArray, mxCreateDoubleMatrix,mxCreateString, mxCreateDoubleScalar.
• Access data members of mxArrays:mxGetPr, mxGetPi, mxGetM, mxGetN.
• Modify data members:mxSetPr, mxSetPi.
• Manage mxArray memory:mxMalloc, mxCalloc, mxFree, mxDestroyArray.
2/9/07 Course Title 16
Mex file for CUDA used in case study – Part 1
Compilation instructions under MATLAB:
nvmex -f nvmexopts.bat square_me_cuda.cu -IC:\cuda\include -LC:\cuda\lib -lcufft -lcudart
#include "cuda.h"#include "mex.h" /* Kernel to compute elements of the array on the GPU */__global__ void guilherme_kernel(float* K, float* g1, float* M, int N){ int k = 1.3; int i = blockIdx.x*blockDim.x+threadIdx.x; int j = blockIdx.y*blockDim.y+threadIdx.y; if ( i < N && j < N) M[i+j*N]=g1[N+i-j]*(K[i]+k)*(K[j]+k);}
2/9/07 Course Title 17
Mex file for CUDA used in case study – Part 2
/* Gateway function */void mexFunction(int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[]){ int j, m_0, m_1, m_o, n_0, n_1, n_o; double *data1, *data2, *data3; float *data1f, *data2f, *data3f; float *data1f_gpu, *data2f_gpu, *data3f_gpu; mxClassID category; if (nrhs != (nlhs+1)) mexErrMsgTxt("The number of input and output arguments must be the same."); /* Find the dimensions of the data */ m_0 = mxGetM(prhs[0]); n_0 = mxGetN(prhs[0]); /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data1f_gpu,sizeof(float)*m_0*n_0); /* Retrieve the input data */ data1 = mxGetPr(prhs[0]); /* Check if the input array is single or double precision */ category = mxGetClassID(prhs[0]); if( category == mxSINGLE_CLASS) { /* The input array is single precision, it can be sent directly to the card */ cudaMemcpy( data1f_gpu, data1, sizeof(float)*m_0*n_0, cudaMemcpyHostToDevice); }
2/9/07 Course Title 18
Mex file for CUDA used in case study – Part 3
/* Find the dimensions of the data */ m_1 = mxGetM(prhs[1]); n_1 = mxGetN(prhs[1]); /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data2f_gpu,sizeof(float)*m_1*n_1); /* Retrieve the input data */ data2 = mxGetPr(prhs[1]); /* Check if the input array is single or double precision */ category = mxGetClassID(prhs[1]); if( category == mxSINGLE_CLASS) { /* The input array is single precision, it can be sent directly to the card */ cudaMemcpy( data2f_gpu, data2, sizeof(float)*m_1*n_1, cudaMemcpyHostToDevice); }
/* Find the dimensions of the data */ m_o = n_0; n_o = n_1; /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data3f_gpu,sizeof(float)*m_o*n_o);
2/9/07 Course Title 19
Mex file for CUDA used in case study – Part 4
/* Compute execution configuration using 128 threads per block */ dim3 dimBlock(128); dim3 dimGrid((m_o*n_o)/dimBlock.x); if ( (n_o*m_o) % 128 !=0 ) dimGrid.x+=1; /* Call function on GPU */ guilherme_kernel<<<dimGrid,dimBlock>>>(data1f_gpu, data2f_gpu, data3f_gpu, n_o*m_o); data3f = (float *) mxMalloc(sizeof(float)*m_o*n_o); /* Copy result back to host */ cudaMemcpy( data3f, data3f_gpu, sizeof(float)*n_o*m_o, cudaMemcpyDeviceToHost); /* Create an mxArray for the output data */ plhs[0] = mxCreateDoubleMatrix(m_o, n_o, mxREAL); /* Create a pointer to the output data */ data3 = mxGetPr(plhs[0]);
2/9/07 Course Title 20
Part III: Device options
GPU Model Memory Threads Price (R$)
8600 GT 256 MB 3,072 150.00
8600 GT 512 MB 3,072 300.00
8800 GT 512 MB 12,288 800.00
9800 GTX 512 MB(DDR3) 12,288 1,200.00
9800 GX2 1 GB(DDR3) 24,576 2,500.00
2/9/07 Course Title 21
References
Gokhale M. et al, Hardware Technologies for High-Performance Data-Intensive Computing, IEEE Computer, 18-9162, pg 60, 2008.
Lietsch S. et al. A CUDA-Supported Approach to Remote Rendering, Lecture Notes in Computer Science. 2007.
Fujimoto N. Faster Matrix-Vector Multiplication on GeForce 8800 GTX, IEEE, 2008.
Book Reference
NVIDIA Corporation, David, NVIDIA CUDA Programming Guide, Version 1.1, 2007.
2/9/07 Course Title 22
Questions ?
So long and thanks by all the fish!