Case Study - Bob

  • Bob is working on a computer-vision project that is a machine learning project.
  • He is tasked with training a neural network on a large dataset.
  • Bob is working on Topaz as GPUs excel at performing the mathematical operations used in machine learning through shared-memory Parallelism.
  • Bob finds that his work is Memory bound as he can fit only a small number of images (‘batch size’) in the VRAM of one GPU on Topaz.
  • He solves this by modifying by using the Horovod Python package. It allows him to train his neural network over multiple GPUs, with each GPU learning on a partition of his dataset.
  • Bob’s final workflow uses a combination of shared-memory (GPU) and distributed-memory (communication between GPU nodes) parallelism.