Fork me on GitHub

# MT Marathon 2019

Last updated: 24 November 2020

## Introduction

In this tutorial we will learn how to do efficient neural machine translation using the Marian toolkit by optimizing the speed, accuracy and use of resources for training and decoding of NMT models.

No background knowledge about Marian is required, but if you are completely new to neural machine translation and Marian, you may take a look at at the introductory tutorial to Marian first. We assume that the reader is familiar with basic Linux commands.

The tutorial requires Marian in version v1.7.12+, which is currently only available in the marian-dev repository. However, most parts are also applicable for older versions of Marian.

### Installation

Marian can be compiled on machines with NVIDIA GPU devices and CUDA 8.0+ or on CPU-only machines. For this tutorial we need Marian compiled with support for CPU and SentencePiece. This requires installing Intel MKL and Protocol Buffers first.

git clone https://github.com/marian-nmt/marian-dev
cd marian-dev
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_SENTENCEPIECE=on -DCOMPILE_CPU=on
make -j
cd ../..


In this tutorial we assume that Marian is compiled in your home directory. If everything worked correctly you can display the list of options with:

~/marian-dev/build/marian --help |& less


### Models and data

First we need to download models and data (611MB) that we will use for the tutorial:

wget http://data.statmt.org/romang/marian-examples/marian-tutorial-mtm19.tgz
tar zxvf marian-tutorial-mtm19.tgz
cd marian-tutorial-mtm19


This will give you the following files:

marian-tutorial-mtm19
├ 1_decoding
│   ├ data
│   │   ├ newstest2013.de
│   │   ├ newstest2013.en
│   │   ├ newstest2014.de
│   │   └ newstest2014.en
│   ├ model.npz
│   ├ run-me.sh
│   └ vocab.spm
├ 2_training
│   ├ config.yml
│   ├ run-me.sh
│   └ train.log
├ 3_student
│   ├ data
│   │   ├ newstest2014.bpe.en
│   │   ├ newstest2014.de
│   ├ lex.s2t
│   ├ model.student.base
│   │   └ model.npz
│   ├ model.student.small
│   │   └ model.npz
│   ├ model.student.small.aan.nogate.noffn
│   │   └ model.npz
│   ├ run-me.sh
│   └ vocab.ende.yml
├ detruecase.perl
└ multi-bleu.perl


## 1. Decoding

For this part of the tutorial we will use an RNN model trained on the WMT’14 English-German corpus provided by the Stanford NLP Group pre-processed with a true-caser and SentencePiece-based subword segmentation. The model is in the marian-tutorial-mtm19/1_decoding folder.

Models trained with Marian can be decoded using the marian-decoder command. The basic usage requires specifying paths to the model and vocabularies:

~/marian-dev/build/marian-decoder -m model.npz -v vocab.spm vocab.spm < data/newstest2014.en > output.de


where newstest2014.en is an input file with pre-processed source sentences and output.de is an output file with translations.

Now we want to calculate the speed and quality of the translation. Marian already prints the total time required to translate the input file (exclusive of loading times of models and vocabularies) at the end of the decoding, for example:

[2019-08-25 13:17:33] Total time: 119.24629s wall

To estimate the quality, we need to run a BLEU scorer:

cat output.de | ../multi-bleu.perl data/newstest2014.de


This should display:

BLEU = 23.90, 56.0/29.6/17.7/11.1 (BP=1.000, ratio=1.003, hyp_len=59469, ref_len=59297)

### Batched decoding

Batched decoding parallelizes translation of multiple sentences. It generates translation for whole mini-batches and significantly increases translation speed, roughly by a factor of 10 or more.

In Marian there are a couple of options that are important for batched translation:

• --mini-batch enables translation with a mini-batch size of N, i.e. translating N sentences at once. allows for better packing of the batch.
• --mini-batch-words allows to specify the size of a mini-batch in terms of a number of words, not a number of sentences.
• --maxi-batch preloads M mini-batches, i.e. M x N sentences.
• --maxi-batch-sort sorts them according to source sentence length, this

An important option is also --workspace or -w, which set the working memory. The default working memory is 512 MB and Marian will increase it to match to requirements during translation, but pre-allocating memory makes it usually a bit faster.

Last but not least, you can also parallelize translation by running it on multiple GPUs using the --devices option.

Translate newstest2014.en using different settings for batched translation, compare decoding times, and complete the table below.

Which parameter improves the translation speed the most? Are you getting exactly the same outputs from line-by-line and batched translation?

System Speed (sec.)
Line-by-line translation
Switching to mini-batch of 64
Increasing workspace to 4GB

### Optimizing parameters

There are a few options that impact the translation speed and quality, and beam size is one of the most important. It determines a number of translation hypotheses considered for each input word. Exploring too few hypotheses per word may lead to a globally suboptimal translation, while using too large beam size may increase resource usage and computation time without improving the translation quality that much.

Options that can be easily tuned include:

• --beam-size determines a number of translation hypotheses explored at each step of the beam search algorithm. Common beam sizes are 8 to 12, depending on the model.
• --max-length-factor sets maximum target length as source length times factor. The default value is 3.
• --normalize divides translation score by the translation length to the power of $$\alpha$$.

Explore different configurations of the beam size and maximum length factor and try to improve the translation time, keeping BLEU at a good level. Next grid search the value of the length normalization parameter on newstest2013 and check if the improvement of BLEU transfers into newstest2014.de.

What are the reasonable values for the beam size? What is a danger of using too small value for the maximum length factor?

This can be done with a simple bash script:

for i in 1 2 4 8 12 24; do
~/marian-dev/build/marian-decoder -b $i [OTHER OPTIONS] < data/newstest2013.en > output.b$i.de
done


A reasonable values to consider for --max-length-factor are in a range of 1 to 3. The normalization parameter $$\alpha$$ can be grid searched for values around 1.0, usually in a range of 0.2 to 2.

### Back-translation

Generation of back-translations often require translation of a large amount of monolingual data (sometimes even hundreds of millions of sentences), so optimizing this process can save you quite a lot of time.

1. If your dataset is large, consider splitting it into smaller chunks (see GNU split) and translate the chunks sequentially. This will not make your translation faster, but more stable.
2. Use batched translation and adjust your decoding parameters too speed up the translation.
3. Because monolingual data, especially if it come from web crawling, can contain very long sentences, use smaller --max-length and turn on --max-length-crop.
4. Use sampled back-translations with --output-sampling.

To test sampled translations, run several times:

head data/newstest2014.en | ~/marian-dev/build/marian-decoder -m model.npz -v vocab.spm vocab.spm \
--output-sampling


## 2. Training

An efficient training of an NMT model might mean minimizing the training time or improving the convergence. This is also choosing a model architecture that is best suited for yours needs.

### Model architecture

Transformer models often offer best quality, but are also more difficult to train than RNNs, and need more careful setting of training parameters. A good starting point for finding a good set of training hyper-parameters is our repository with Marian examples.

Since version 1.7.12 Marian provides the --task option, which provides pre-defined options for transformer-base, and transformer-big:

~/marian-dev/build/marian --task transformer-base --dump-config expand


You should keep in mind that those settings are just a starting point and they should be adjusted for your scenario. For example, a low-resource scenario will probably require stronger regularization, etc.

### GPU memory

Training of some model architectures like Transformer may benefit from large mini-batches.

--mini-batch-fit overrides the specified mini-batch size and automatically chooses the largest mini-batch for a given sentence length that fits the specified memory. When --mini-batch-fit is set, memory requirements are guaranteed to fit into the specified workspace. Choosing a too small workspace will result in small mini-batches which can prohibit learning.

In multi-GPU setting, a synchronous training (--sync-sgd) can also serve as a method to increase the effective size of mini-batches.

The option --workspace sets the size of the memory available for the forward and backward step of the training procedure. This does not include model size and optimizer parameters that are allocated outsize workspace. Hence you cannot allocate all GPU memory to workspace.

It is a bit tricky to determine the effective size of a mini-batch with dynamically sized mini-batches. An alternative is to operate in terms of the total workspace size that is used to generate mini-batches for one update, which can be calculated with the following formula:

$\text{Workspace (MB)} \times \text{Number of GPUs} \times \text{Delayed updates}$

Cumulative or delayed gradient updates increase the effective batch size, or the amount of data used at each step of training. It can be enabled with --optimizer-delay N.

Setup training of a transformer-base model with the total workspace of ca. 48 GB per update, and calculate the average size of mini-batches from training logs.

cd ../2_training


Start with the following command and set --workspace, --devices and --optimizer-delay options:

~/marian-dev/build/marian --task transformer-base -c config.yml


The optimal values for these options will differ depending on the number of GPUs and available memory per single GPU.

## 3. Teacher-student

Knowledge distillation approaches, like the teacher-student method, are used for training a smaller student model to perform better by learning from a larger teacher model.

In NMT it might mean training a strong (and slow) ensemble of Transformer models as a teacher, then translating an entire source side of the training data with the teacher, and finally training a small (and fast) model on original source and translated target data. The student model has usually worse BLEU, but is much faster. A useful side effect of the teacher-student learning is an improved performance of the student with smaller beam sizes.

More details can be found in our paper on cost-effective and high-quality NMT with Marian.

### CPU decoding

The teacher-student method enables efficient translation on CPU. A couple of things can speed it up more:

• Batched translation as presented in the first part of the tutorial.
• A lexical shortlist restricts the output vocabulary to a small subset of translation candidates for each words.
• Auto-tuning of matrix product implementation with --optimize.

The --cpu-threads option turns on the decoding on CPU.

Experiment with different settings for decoding on a single CPU. Check different student models, batched translation, lexical shortlists and auto-tuning.

How does a student model perform with small beam sizes? What is the speed improvement from a lexical shortlist on CPU and GPU? What are architectures of student models and how large are they? What is the issue with decoding on multiple CPU threads with model files in .npz format?

The models can be found in the ./3_student folder. They have been trained on true-cased and BPE-segmented data, so require pre- and post-processing:

cd ../3_student
cat data/newstest2014.bpe.en \
| ~/marian-dev/build/marian-decoder -m model.student.small/model.npz -v vocab.ende.yml vocab.ende.yml \

The shortlist file is 3_student/data/lex.s2t. Model-specific parameters can be read from a model.npz file using the script marian-dev/scripts/contrib/model_info.py.