Palestra Extraordinária: Beyond Manycores Comes Heterogeneous Accelerated Computing:
Mauricio Breternitz Jr, Ph.D., Advanced Software and Analytics Technology Group, Advanced Micro Devices, Austin, Tx, dia 15/02, às 15:30h, Auditório do IC, Sala 85 - IC 2.
| What | Palestra |
|---|---|
| When |
15/02/2011 from 15:30 to 17:30 |
| Where | Auditório do IC (Sala 85) - IC2 |
| Add event to calendar |
|
The microprocessor industry has recently undergone and is still absorving a transition to multiple execution cores. This change was motivated by exponential increases in power consumption and area costs, which precluded the continued growth in processing frequency and single core complexity. Currently CPU vendors place multiple (10's) of cores in a single chip. Software vendors are (still) looking for new parallelization technology. About the same time frame, parallel processing was being utilized efficiently in special applications such as scientific computing and graphics processing. Graphics processing units (GPUs) capable of multiple giga flops became common and affordable. Once the potential of heterogenous systems utilizing GPU acceleration has been realized, the search for GPGPU (general purpose GPU computing) opportunities has been launched. However, the heterogeneous processor (e.g. GPU) is usually connected via an attached bus (PCI) to the system, introducing extra complexity, latency, and bandwidth limitations to be overcome. Furthermore, the SIMD execution model requires requires higly regular parallelism which is not present is all computations. Still, there area a good number of applications for which this solution is cost effective. Traditionally, GPU instruction sets are not exposed beyond the operating-system-specific drivers and as such are not public or stable across generations. To enable widespread adoption and cross-generation legacy, approaches such as CUDA and (open standards based) OpenCL have emerged. A recent development prescribes a tight integration of scalar execution cores with a highly-parallel accelerator via access to shared memory. This organization is called an APU - accelerated processing unit. The scalar execution cores provide efficient execution for the 'serial portion' of applications, the multiplicity of cores provide efficient speedup in situations in which a limited amount of parallelism has been identified, and the highly paralllel accelerator provides cost-effective/low power performance. APUs enable a wider class of applications beyond the regular, highly parallel paradigm computing pattern required for GPU acceleration. The advent of APUs introduce a new set of challenges: the programmer (along with the system runtime) must specify and decide at each step of the computation the appropriate execution resource that provides the best efficient performance. We describe challenges, initial results, and potential research ideas to improve the programmability and efficiency of such systems.
