, 1 min read
Day 1, Workshop Programming of Heterogeneous Systems in Physics
As announced in Workshop Programming of Heterogeneous Systems in Physics, July 2014, I attended this two-day conference in Jena, Germany. Below are speakers and pictures with my personal notes.
- Dipl.-Ing. Hans Pabst from Intel, Programming for the future: scaling forward with cores and vectors.  Notes: Hot spots are usually known, either BLAS or FFT, halo, Intel Xeon Phi very sensitive to unaligned data, 1.5 TB/s attached devices, provided copy of book "Intel Xeon Phi Coprocessor High-Performance Programming", by Jim Jeffers and James Reinders. The book authors thank Hans Pabst for his help. Notes: Hot spots are usually known, either BLAS or FFT, halo, Intel Xeon Phi very sensitive to unaligned data, 1.5 TB/s attached devices, provided copy of book "Intel Xeon Phi Coprocessor High-Performance Programming", by Jim Jeffers and James Reinders. The book authors thank Hans Pabst for his help. Amazon link Amazon link
- Dr.-Ing. Timo Stich from NVIDIA, NVIDIA CUDA.  Notes: There are two groups at NVIDIA, one for Windows, one for Linux, libraries are often memory limited, CUDA 6 unified memory model is not a performance feature but rather to ease the porting of software, one can use CUDA even from Java and JavaScript Notes: There are two groups at NVIDIA, one for Windows, one for Linux, libraries are often memory limited, CUDA 6 unified memory model is not a performance feature but rather to ease the porting of software, one can use CUDA even from Java and JavaScript
- Dr. Karl Rupp, Vienna, An Introduction to OpenCL for Scientific Computing.  Notes: When using MPI on multiple nodes then the interconnect is mostly the bottleneck, you can query devices in OpenCL, there is no global reduction operator in OpenCL, OpenACC is not your best pick if memory movement is important, OpenCL improved a lot with CUDA 5, OpenCL not good at CG at low dimensions $n<10^5$ Notes: When using MPI on multiple nodes then the interconnect is mostly the bottleneck, you can query devices in OpenCL, there is no global reduction operator in OpenCL, OpenACC is not your best pick if memory movement is important, OpenCL improved a lot with CUDA 5, OpenCL not good at CG at low dimensions $n<10^5$
In the evening there was a joint dinner at Berggaststätte Fuchsturm. 