Introducing NVIDIA’s "NLOTSD UMHfiDC DDUHBD Architecture (CUDA)
This article, the first in a series, introduces readers to the NVIDIA CUDA architecture, as good programming requires a decent amount of knowledge about the architecture.
Jack DonJaUUa, SUofessoU at the UniYeUsity of Tennessee anG authoU of LinSack has saiG, “GUaShics PUocessinJ Units haYe eYolYeG to the Soint wheUe Pany UealwoUlG aSSlications aUe easily iPSlePenteG on theP, anG Uun siJnificantly fasteU than on Pulti-coUe systePs. FutuUe coPSutinJ aUchitectuUes will Ee hyEUiG systePs with SaUallelcoUe GPUs woUkinJ in tanGeP with Pulti-coUe CPUs.”
PUoject PanaJeUs often instUuct GeYeloSeUs to iPSUoYe theiU alJoUithP so that theiU coPSute efficiency of theiU aSSlication incUeases. We all know SaUallel SUocessinJ is fasteU, Eut theUe was always a GouEt whetheU it woulG Ee woUth the effoUt anG tiPe—Eut not any PoUe! GUaShics PUocessinJ Units (GPUs) haYe eYolYeG to flexiEle anG powerful processors, which are now programmable using hiJh-leYel lanJuaJes suSSoUtinJ 32-Eit anG 64-Eit floatinJSoint SUecision anG Go not UeTuiUe SUoJUaPPinJ in assePEly. They offeU a lot of coPSutational SoweU, anG this is the SUiPaUy Ueason that GeYeloSeUs toGay aUe focussinJ on JettinJ the PaxiPuP Eenefit of this extUePe scalaEility.
,n the last few yeaUs, Pass PaUketinJ of Pulti-coUe GPUs has brought terascale computing power to laptops and petascale computing power to clusters. A CPU + GPU is a powerful combination, because CPUs consist of a few cores optimised for serial processing, while GPUs consist of thousanGs of sPalleU, PoUe efficient coUes GesiJneG foU parallel performance. Serial portions of the code run on the CPU, while parallel portions run on the GPU.
The CoPSute UnifieG DeYice AUchitectuUe (CUDA) is a SaUallel SUoJUaPPinJ aUchitectuUe GeYeloSeG Ey NV,D,A. CUDA is the coPSutinJ enJine in NV,D,A GPUs that JiYes GeYeloSeUs access to the YiUtual instUuction set anG PePoUy of the SaUallel coPSutational elePents in the CUDA GPUs, thUouJh YaUiants of inGustUy-stanGaUG SUoJUaPPinJ languages. bxploiting data parallelism on the GPU has become siJnificantly easieU with neweU SUoJUaPPinJ PoGels like OSenACC, which SUoYiGes GeYeloSeUs with siPSle coPSileU GiUectiYes to Uun theiU aSSlications in SaUallel on the GPU.
Recently, at the 19th ,EEE +iPC confeUence helG in Pune, , Pet seYeUal GeleJates fUoP acaGePia anG inGustUy who wanteG to Pake use of this extUePe coPSutinJ SoweU,
to run their programs in parallel and get faster results than they woulG noUPally Jet usinJ Pulti-coUe CPUs. GUaShics UenGeUinJ is all aEout coPSute-intensiYe, hiJhly SaUallel coPSutation, such that PoUe tUansistoUs can Ee GeYoteG to processing of data rather than data caching and flow contUol. You can siPSly take tUaGitional C coGe that Uuns on a CPU and offload the data parallel sections of the code to the GPU. Functions executeG on the GPU aUe UefeUUeG to as coPSuteU keUnels.
Each NV,D,A GPU has hunGUeGs of coUes, wheUe each coUe has a floatinJ Soint unit, loJic unit, PoYe, coPSaUe unit anG a EUanch unit. CoUes aUe PanaJeG Ey the thUeaG PanaJeU, which can manage and spawn thousands of threads per core. TheUe is no oYeUheaG in thUeaG switchinJ.
CUDA is C foU SaUallel SUocessoUs. You can wUite a SUoJUaP foU one thUeaG, anG then instantiate it on Pany parallel threads, exploiting the inherent data parallelism of youU alJoUithP. CUDA C coGe can Uun on any nuPEeU of SUocessoUs without the neeG foU UecoPSilation, anG you can PaS CUDA thUeaGs to GPU thUeaGs oU to CPU YectoUs. CUDA thUeaGs exSUess fine-JUaineG Gata SaUallelisP anG YiUtualise the SUocessoUs. On the otheU hanG, CUDA thUeaG Elocks exSUess coaUse-JUaineG SaUallelisP, as Elocks holG aUUays of GPU thUeaGs.
Kernels
CUDA C extenGs C Ey allowinJ the SUoJUaPPeU to Gefine C functions calleG keUnels, which, when calleG, aUe executeG N tiPes in SaUallel Ey N GiffeUent CUDA thUeaGs, as oSSoseG to only once like UeJulaU C functions. A keUnel is executeG Ey a JUiG, which contains Elocks.
The CUDA loJical hieUaUchy (FiJuUe 2) exSlains the Soints GiscusseG aEoYe with UesSect to JUiGs, Elocks anG thUeaGs.
A Elock contains a nuPEeU of thUeaGs. A thUeaG Elock or ‘warp’ is a collection of threads that can use shared data thUouJh shaUeG PePoUy anG synchUonise theiU execution. ThUeaGs fUoP GiffeUent Elocks oSeUate inGeSenGently, anG can Ee useG to SeUfoUP GiffeUent functions in SaUallel. Each Elock anG each thUeaG is iGentifieG Ey a ‘EuilG-in’ Elock inGex anG thUeaG inGex accessiEle within the keUnel. The confiJuUation SlacePent is GeteUPineG Ey the SUoJUaPPeU when launchinJ the keUnel on the GeYice, sSecifyinJ Elocks SeU JUiG anG thUeaGs SeU Elock. PUoEaEly, this woulG Ee a lot of Gata to take in for someone who has just been introduced to the world of CUDA, Eut tUust Pe, this is Puch PoUe inteUestinJ once you sit Gown anG staUt SUoJUaPPinJ with CUDA.
Well, , EelieYe that Ey now, you haYe a Easic unGeUstanGinJ of CUDA thUeaG hieUaUchy anG the PePoUy hieUaUchy. One iPSoUtant Soint to consiGeU heUe is that all aSSlications won’t scale well on the CUDA GeYice. , t is well suiteG foU SUoElePs that can Ee EUoken Gown into thousanGs of sPalleU chunks, to Pake use of the intensiYe thUeaGs in the aUchitectuUe. CUDA can take the Eest aGYantaJe of C, one of the Post wiGely useG programming languages. You do not need to write the entiUe coGe in CUDA. Only when SeUfoUPinJ soPethinJ coPSutationally exSensiYe, you coulG wUite a CUDA sniSSet anG inteJUate it with youU existinJ coGe, thus SUoYiGinJ the UeTuiUeG sSeeGuS.
NV, D, A has solG PoUe than 100 Pillion CUDA GeYices since 2006. With PassiYe SaUallel SUoJUaPPinJ UeachinJ the enG useUs anG EecoPinJ a coPPoGity technoloJy, it is essential foU a GeYeloSeU to unGeUstanG the architecture and programming.
, will coYeU the Easics of CUDA SUoJUaPPinJ in an upcoming article. Till then, it would be worthwhile to put on youU thinkinJ caSs anG staUt thinkinJ aEout alJoUithPs in SaUallel. With PassiYe SaUallel SUoJUaPPinJ UeachinJ the enG useUs anG EecoPinJ a coPPoGity technoloJy, it is essential foU a GeYeloSeU to unGeUstanG the aUchitectuUe anG SUoJUaPPinJ of these GeYices which has UeGefineG the world of parallel computing.