High Performance Computing

Simulation of fluid flows can be prohibitively expensive, especially for high Reynolds number (turbulent) flows where vortices have dramatically different sizes. To resolve these vortices, especially those with very small sizes, the computational grid has to be very fine, resulting in excessive computational cost. Meanwhile, finer grid requires smaller time step size, which further increases the overall computational cost. For example, direct numerical simulation (DNS) of a turbulent flow at moderate Reynolds number using a single CPU core can take hundreds of years to reach converged solution (this is not a joke).

High performance computing (HPC) on massively parallel computer clusters is one of the most effective ways to address the aforementioned challenge. The figure below shows a schematic of the structure of a typical HPC cluster. The compute nodes are interconnected to each other (e.g., using Ethernet or InfiniBand switches) to allow mutual communications in a local network. The master node (or login node) is connected to both this local network and the outer world to allow users to remotely manage jobs that are running (or to be ran) on the compute nodes. Each node usually has its own local storage, memories, and synchronized operating system. The overall cluster can also have external shared file systems with very large capacity for data storage.

My experience with HPC started in 2009 when I was a Ph.D. student at University of Notre Dame doing research on compressible turbulent flow simulation using a solver called "CDP" that was developed at Stanford University's Center for Turbulence Research. I was initially running test cases on a cluster with 200+ CPU cores, and then moved to much larger supercomputers (with upto hundreds of thousands of cores) located at national computing centers. The pictures below are some of the supercomputers that I have worked on in the past years. In addition to being a supercomputer user, I also rebuilt and managed a small cluster with 240 cores at the Computational Aero and Hydrodynamics Laboratory in the George Washington University (see the CAHL Cluster WikiPage for more information).

My experience on parallel program development dates back to 2010. During that time, I developed multiple parallel post-processing programs (for example, parallel interpolation and integration programs on unstructured grids) using Fortran and the Message Passing Interface (MPI) library. Then in 2012, I started developing my own high-order accurate and massively parallel flow solvers that work on dynamic moving unstructured grids in both 2D and 3D. The underlying idea for parallelization is straight forward: we decompose a mesh into small partitions, and distribute each partition on to a processor for computation, and a processor needs to exchange information with its neighbouring processors by calling MPI subroutines. For example, the figure on the left shows the partitions of a simple mesh, where each color represents a partition. The figure on the right shows the parallel efficiency of one of my solvers on a mesh with several thousands of grid elements. It is seen that the solver can achieve almost ideal speedup as the scheme order increases. The development of a good parallel program requires not only in-depth understanding of the software, but also the hardware.