Fast Neural Machine Translation in C++
The models used for the translation speed benchmarks have been described in the IWSLT paper.
We ran our experiments on an Intel Xeon E5-2620 2.40GHz server with four NVIDIA GeForce GTX 1080 GPUs. We present the words-per-second ratio for our NMT models using Amun and Nematus, executed on the CPU and GPU. For the CPU version we use 16 threads, translating one sentence per thread. We restrict the number of OpenBLAS threads to 1 per main Nematus thread. For the GPU version of Nematus we use 5 processes to maximize GPU saturation. As a baseline, the phrase-based model reaches 455 words per second using 16 threads.
The CPU-bound execution of Nematus reaches 47 words per second while the GPU-bound achieved 270 words per second. In similar settings, CPU-bound Marian is three times faster than Nematus CPU, but three times slower than Moses. With vocabulary selection (systems with asteriks) we can nearly double the speed of Amun CPU. The GPU-executed version of Amun is more than three times faster than Nematus and nearly twice as fast as Moses, achieving 865 words per second, with vocabulary selection we reach 1,192. Even the speed of the CPU version would already allow to replace a Moses-based SMT system with an Marian-based NMT system in a production environment without severely affecting translation throughput.
Amun also features “batched” translation, i.e. multiple sentences are being translated at once on a single GPU. Since computation time for matrix products on the GPU increases sub-linearly with regard to matrix size, we can take advantage of this by pushing multiple translation through the neural network. For the same models as above and a batch-size of 200 (beam-size 5) we achieve over 5000 words per second on one GPU. This scales linearly to the number of GPUs used. As before, the asteriks marks systems with vocabulary filtering. Systems “Single” and “Single*” are the same as two best systems in the first graph.
We also compare training speed between a number of popular toolkits and Marian (v.1.0.0). The numbers reported in this section have been computed on a single GPU.
We compare models with standard settings and comparable embedding, hidden layer and batch sizes. The first graph corresponds to the model parameters described in the OpenNMT paper, the second corresponds to Nematus default settings for embedding and hidden layer sizes. In both cases we use a vocabulary size of 32,000 subword units. The models were trained on German-English WMT data. Nematus-array is Nematus run with the new Cuda backend libgpuarray. Blue bars describe training speed on a NVIDIA GTX 1080 GPU, green bars on a Titan X with Pascal architecture.
We demonstrate the training speed as thousands of source tokens per second for different model types trained with Marian v1.5 (see the Marian demonstration paper for more details).
All model types benefit from using more GPUs. Scaling is not linear (dashed lines), but close. The tokens-per-second rate (w/s) for Nematus on the same data on a single GPU is about 2800 w/s for the shallow model. Nematus does not have multi-GPU training. Marian achieves about 4 times faster training on a single GPU and about 30 times faster training on 8 GPUs for identical models.