Matthew O'Connell
A Fault Tolerant Grid Generation Technique
A Dissertation Presented for the Doctor of Philosophy in Computational Engineering, The University of Tennessee at Chattanooga
Matthew D. O'Connell, August 2016
Abstract:
Automatic and parallel mesh generation has been highlighted as a bottleneck for large scale automated Computational Fluid Dynamics analysis. The desire for large scale automated CFD is driven by the growing computational capabilities in large scale supercomputers. Unfortunately, as compute clusters grow in size, they also suffer more failures. Left unchecked, the increased frequency of failures may stymie any efforts to fully utilize these machines. This work aims to tackle one component required for automated large scale engineering analysis by developing a fault tolerant mesh generator. The mesh generator uses a novel communication layer written using the transport layer ZeroMQ and is made fault tolerant through an integrated in-memory checkpoint and recovery strategy. Benefits of using in-memory checkpoints vs traditional in-disk checkpoints are discussed. By relying on in-memory checkpointing, it is demonstrated that the mesh generator to be capable of generating Cartesian meshes in parallel. The generator continues to operate even while the compute cluster it is running suffers failures. The generator is shown to be high performing, including being capable of generating an 8.6 billion element mesh in just over 1 minute while creating multiple in-memory checkpoints.
Click here to access a copy of Matthew's dissertation.